SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Fault Tolerance
Basic System Concept
Basic Definitions

• Failure: deviation of a system from
behaviour described in its specification.
• Error: part of the state which is incorrect.
• Fault: an error in the internal states of the
components of a system or in the design
of a system.
… and Donald Rumsfeld said:

There are known knowns. These are
things we know that we know. There
are known unknowns. That is to say,
there are things that we know we
don't know. But there are also
unknown unknowns. There are
things we don't know we don't know.
Types of Faults
• Hard faults
– Permanent

Resulting failures are called hard failures
• Soft faults
– Transient or intermittent
– Account for more than 90% of all failures

Resulting failures are called soft failures
Fault Classification
Failure Detection

MTBF: Mean Time Between Failure
MTTD: Mean Time To Discovery
MTTR: Mean Time to Repair
Failure Types
Distributed Algorithms
• Primary focus in Distributed Systems is on a
number of concurrently running processes
• Distributed system is composed of n processes
• A process executes a sequence of events
– Local computation
– Sending a message m
– Receiving a message m

• A distributed algorithm makes use of more than
one process.
Properties of Distributed Algorithms
• Safety
– Means that some particular “bad” thing never
happens.

• Liveness
– Indicates that some particular “good” thing will
(eventually) happen.

Timing/failure assumptions affect how we
reason about these properties and what we
can prove
Timing Model
• Specifies assumptions regarding delays between
– execution steps of a correct process
– send and receipt of a message sent between correct
processes

• Many gradations. Two of interest are:
Synchronous
Known bounds on message
and execution delays.

Asynchronous
No assumptions about
message
and execution delays
(except that they are finite).

• Partial synchrony is more realistic in distributed
system
Synchronous timing assumption
• Processes share a clock
• Timestamps mean something between
processes
– Otherwise processes are synchronised using a
time server

• Communication can be guaranteed to
occur in some number of clock cycles
CSC469

11/30/13
Asynchronous timing assumption
• Processes operate asynchronously from
one another.
• No claims can be made about whether
another process is running slowly or has
failed.
• There is no time bound on how long it takes
for a message to be delivered.
Partial synchrony assumption
• “Timing-based distributed algorithms”
• Processes have some information about
time
– Clocks that are synchronized within some
bound
– Approximate bounds on message-deliver time
– Use of timeouts
Failure Model
• A process that behaves according to its I/O
specification throughout its execution is called
correct
• A process that deviates from its specification is
faulty
• Many gradations of faulty. Two of interest are:

Fail-Stop failures
A faulty process halts
execution prematurely.

Byzantine failures
No assumption about
behavior of a faulty process.
Errors as failure assumptions
• Specific types of errors are listed as failure
assumptions
– Communication link may lose messages
– Link may duplicate messages
– Link may reorder messages
– Process may die and be restarted
Fail-Stop failure
• A failure results in the process, p, stopping
– Also referred to as crash failure
– p works correctly until the point of failure

• p does not send any more messages
• p does not perform actions when
messages are sent to it
• Other processes can detect that p has
failed
Fault/failure detectors
• A perfect failure detector
– No false positives (only reports actual failures).
– Eventually reports failures to all processes.

• Heartbeat protocols
– Assumes partially synchronous environment
– Processes send “I’m Alive” messages to all
other processes regularly
– If process i does not hear from process j in
some time T = Tdelivery + Theartbeat then it determines
that j has failed
– Depends on Tdelivery being known and accurate
Other Failure Models
• Omission failure
• Process fails to send messages, to receive
incoming messages, or to handle incoming
messages

• Timing failure
• process‘s response lies outside specified time
interval

• Response failure
• Value of response is incorrect
Byzantine failure
• Process p fails in an arbitrary manner.
• p is modeled as a malevolent entity
– Can send the messages and perform the
actions that will have the worst impact on other
processes
– Can collaborate with other “failed” processes

• Common constraints
– Incomplete knowledge of global state
– Limited ability to coordinate with other
Byzantine processes
Setup of Distributed Consensus
• N processes have to agree on a single
value.
– Example applications of consensus:
• Performing a commit in a replicated/distributed
database.
• Collecting multiple sensor readings and deciding on
an action

• Each process begins with a value
• Each process can irrevocably decide on a
value
• Up to f < n processes may be faulty
– How do you reach consensus if no failures?
Properties of Distributed Consensus
• Agreement
– If any correct process believes that V is the consensus
value, then all correct processes believe V is the
consensus value.

• Validity
– If V is the consensus value, then some process
proposed V.

• Termination
– Each process decides some value V.

• Agreement and Validity are Safety Properties
• Termination is a Liveness property.
Synchronous Fail-stop Consensus
• FloodSet algorithm run at each process i
– Remember, we want to tolerate up to f
failures
• S is a set of values
• Decide(x) can be
Si  {initial value}
various functions
for k = 1 to f+1
send Si to all processes
receive Sj from all j != i
Si  Si Sj (for all j)
end for
Decide(Si)

• E.g. min(x), max(x),
majority(x), or some
default

• Assumes nodes are
connected and links do
not fail
Analysis of FloodSet
• Requires f+1 rounds because process can
fail at any time, in particular, during send
• Agreement: Since at most f failures, then
after f+1 rounds all correct processes will
evaluate Decide(Si) the same.
• Validity: Decide results in a proposed
value (or default value)
• Termination: After f+1 rounds the algorithm
completes
Example with f = 1, Decide() = min()
1

End of
round 1

End of
round 2

S1 = {0}
2
S2 = {1}

{1}

{0,1}

decide 0

{0,1}

{0,1}

decide 0

3
S3 = {1}
Synchronous/Byzantine Consensus
• Faulty processes can behave arbitrarily
– May actively try to trick other processes

• Algorithm described by Lamport, Shostak, &
Pease in terms of Byzantine generals agreeing
whether to attack or retreat. Simple
requirements:
– All loyal generals decide on the same plan of action
• Implies that all loyal generals obtain the same information

– A small number of traitors cannot cause the loyal
generals to adopt a bad plan
– Decide() in this case is a majority vote, default action is
“Retreat”
Byzantine Generals
• Use v(i) to denote value sent by ith general
• Traitor could send different values to different
generals, so can’t use v(i) obtained from i directly.
New conditions:
– Any two loyal generals use the same value v(i),
regardless of whether i is loyal or not
– If the ith general is loyal, then the value that she sends
must be used by every loyal general as the value of
v(i).

• Re-phrase original problem as reliable broadcast:
– General must send an order (“Use v as my value”) to
lieutenants
– Each process takes a turn as General, sending its
value to the others as lieutenants
– After all values are reliably exchanged, Decide()
Synchronous Byzantine Model
Theorem: There is no algorithm to solve
consensus if only oral messages are used,
unless more than two thirds of the generals are
loyal.
• In other words, impossible if n ≤ 3f for n
processes, f of which are faulty
• Oral messages are under control of the sender
– sender can alter a message that it received before
forwarding it

• Let’s look at examples for special case of n=3,
f=1
Case 1
• Traitor lieutenant tries to foil consensus by
refusing to participate “white hats” == loyal or “good guys”

“black hats” == traitor or “bad guys”

Round 1: Commanding
General sends “Retreat”

Commanding General 1

Round 2: L3 sends “Retreat”
to L2, but L2 sends nothing
Decide: L3 decides “Retreat”

Lieutenant 2

CSC469

R

Loyal lieutenant obeys
commander. (good)

R

Lieutenant 3
R

decides to retreat

11/30/13
Case 2a
• Traitor lieutenant tries to foil consensus by
lying about order sent by general
Round 1: Commanding
General sends “Retreat”

Commanding General 1

Round 2: L3 sends “Retreat”
to L2; L2 sends “Attack” to L3

R

Decide: L3 decides “Retreat”

Loyal lieutenant obeys
commander. (good)

R

Lieutenant 3

Lieutenant 2

R

decides to retreat

A
CSC469

11/30/13
Case 2b
• Traitor lieutenant tries to foil consensus by
lying about order sent by general
Round 1: Commanding
General sends “Attack”

Commanding General 1

Round 2: L3 sends “Attack” to
L2; L2 sends “Retreat” to L3

A

Decide: L3 decides “Retreat”

Loyal lieutenant disobeys
commander. (bad)

A

Lieutenant 3

Lieutenant 2

A

decides to retreat

R
CSC469

11/30/13
Case 3
• Traitor General tries to foil consensus by
sending different orders to loyal lieutenants
Round 1: General sends
“Attack” to L2 and
“Retreat” to L3

Commanding General 1

Round 2: L3 sends “Retreat”
to L2; L2 sends “Attack” to L3

A

Decide: L2 decides “Attack”
and L3 decides “Retreat”

Loyal lieutenants obey
commander. (good)
Decide differently (bad)

R

Lieutenant 3

Lieutenant 2

R

decides to attack

decides to retreat

A
CSC469

11/30/13
Byzantine Consensus: n > 3f
•
•
•

Oral Messages algorithm, OM(f)
Consists of f+1 “phases”
Algorithm OM(0) is the “base case” (no faults)
1) Commander sends value to every lieutenant
2) Each lieutenant uses value received from
commander, or default “retreat” if no value was
received

•

CSC469

Recursive algorithm handles up to f faults

11/30/13

Weitere ähnliche Inhalte

Was ist angesagt?

Fault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunFault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunTsegabrehan Am
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOLKABILESH RAMAR
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactionsNilu Desai
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordinationsiva krishna
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance4020132038
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionSHIKHA GAUTAM
 
Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Govt. P.G. College Dharamshala
 
Byzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhryByzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhrySiddharth Chaudhry
 
Transactions (Distributed computing)
Transactions (Distributed computing)Transactions (Distributed computing)
Transactions (Distributed computing)Sri Prasanna
 
Concepts of Data Base Management Systems
Concepts of Data Base Management SystemsConcepts of Data Base Management Systems
Concepts of Data Base Management SystemsDinesh Devireddy
 

Was ist angesagt? (16)

Fault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunFault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihun
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Distributed Transaction
Distributed TransactionDistributed Transaction
Distributed Transaction
 
Chapter 13
Chapter 13Chapter 13
Chapter 13
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock Detection
 
Distributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithmsDistributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithms
 
Basic Paxos Implementation in Orc
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in Orc
 
Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...Analysis of mutual exclusion algorithms with the significance and need of ele...
Analysis of mutual exclusion algorithms with the significance and need of ele...
 
Byzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhryByzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth Chaudhry
 
Transactions (Distributed computing)
Transactions (Distributed computing)Transactions (Distributed computing)
Transactions (Distributed computing)
 
12EASApril-3412
12EASApril-341212EASApril-3412
12EASApril-3412
 
Concepts of Data Base Management Systems
Concepts of Data Base Management SystemsConcepts of Data Base Management Systems
Concepts of Data Base Management Systems
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 

Ähnlich wie Fault Tolerance Basics

Synchronization
SynchronizationSynchronization
SynchronizationSara shall
 
Inter process communication
Inter process communicationInter process communication
Inter process communicationGift Kaliza
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreementAbDul ThaYyal
 
QCon 2014 - Principles of Reliable Communication
QCon 2014 - Principles of Reliable CommunicationQCon 2014 - Principles of Reliable Communication
QCon 2014 - Principles of Reliable CommunicationAndy Piper
 
How to Keep Your Data Safe in MongoDB
How to Keep Your Data Safe in MongoDBHow to Keep Your Data Safe in MongoDB
How to Keep Your Data Safe in MongoDBMongoDB
 
JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsBert Jan Schrijver
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time SystemsDeepak John
 
Consensus Algorithms: An Introduction & Analysis
Consensus Algorithms: An Introduction & AnalysisConsensus Algorithms: An Introduction & Analysis
Consensus Algorithms: An Introduction & AnalysisZak Cole
 
1. Consensus and agreement algorithms - Introduction.pdf
1. Consensus and agreement algorithms - Introduction.pdf1. Consensus and agreement algorithms - Introduction.pdf
1. Consensus and agreement algorithms - Introduction.pdfAzmiNizar1
 
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfDC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfLegesseSamuel
 
Storing the real world data
Storing the real world dataStoring the real world data
Storing the real world dataAthira Mukundan
 

Ähnlich wie Fault Tolerance Basics (20)

L14.C3.FA18.ppt
L14.C3.FA18.pptL14.C3.FA18.ppt
L14.C3.FA18.ppt
 
Unit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptxUnit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptx
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Inter process communication
Inter process communicationInter process communication
Inter process communication
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreement
 
QCon 2014 - Principles of Reliable Communication
QCon 2014 - Principles of Reliable CommunicationQCon 2014 - Principles of Reliable Communication
QCon 2014 - Principles of Reliable Communication
 
How to Keep Your Data Safe in MongoDB
How to Keep Your Data Safe in MongoDBHow to Keep Your Data Safe in MongoDB
How to Keep Your Data Safe in MongoDB
 
JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systems
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systems
 
Consensus Algorithms: An Introduction & Analysis
Consensus Algorithms: An Introduction & AnalysisConsensus Algorithms: An Introduction & Analysis
Consensus Algorithms: An Introduction & Analysis
 
Software Security and IDS.pptx
Software Security and IDS.pptxSoftware Security and IDS.pptx
Software Security and IDS.pptx
 
Chapter00000000
Chapter00000000Chapter00000000
Chapter00000000
 
운영체제론 Ch16
운영체제론 Ch16운영체제론 Ch16
운영체제론 Ch16
 
1. Consensus and agreement algorithms - Introduction.pdf
1. Consensus and agreement algorithms - Introduction.pdf1. Consensus and agreement algorithms - Introduction.pdf
1. Consensus and agreement algorithms - Introduction.pdf
 
Distributed System
Distributed SystemDistributed System
Distributed System
 
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdfDC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
DC Lecture 04 and 05 Mutual Excution and Election Algorithms.pdf
 
ch 2 - DISTRIBUTED DEADLOCK DETECTION.pptx
ch 2 - DISTRIBUTED DEADLOCK DETECTION.pptxch 2 - DISTRIBUTED DEADLOCK DETECTION.pptx
ch 2 - DISTRIBUTED DEADLOCK DETECTION.pptx
 
Ch5 process synchronization
Ch5   process synchronizationCh5   process synchronization
Ch5 process synchronization
 
Storing the real world data
Storing the real world dataStoring the real world data
Storing the real world data
 
Debugging distributed systems
Debugging distributed systemsDebugging distributed systems
Debugging distributed systems
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Fault Tolerance Basics

  • 3. Basic Definitions • Failure: deviation of a system from behaviour described in its specification. • Error: part of the state which is incorrect. • Fault: an error in the internal states of the components of a system or in the design of a system.
  • 4. … and Donald Rumsfeld said: There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.
  • 5. Types of Faults • Hard faults – Permanent Resulting failures are called hard failures • Soft faults – Transient or intermittent – Account for more than 90% of all failures Resulting failures are called soft failures
  • 7. Failure Detection MTBF: Mean Time Between Failure MTTD: Mean Time To Discovery MTTR: Mean Time to Repair
  • 9. Distributed Algorithms • Primary focus in Distributed Systems is on a number of concurrently running processes • Distributed system is composed of n processes • A process executes a sequence of events – Local computation – Sending a message m – Receiving a message m • A distributed algorithm makes use of more than one process.
  • 10. Properties of Distributed Algorithms • Safety – Means that some particular “bad” thing never happens. • Liveness – Indicates that some particular “good” thing will (eventually) happen. Timing/failure assumptions affect how we reason about these properties and what we can prove
  • 11. Timing Model • Specifies assumptions regarding delays between – execution steps of a correct process – send and receipt of a message sent between correct processes • Many gradations. Two of interest are: Synchronous Known bounds on message and execution delays. Asynchronous No assumptions about message and execution delays (except that they are finite). • Partial synchrony is more realistic in distributed system
  • 12. Synchronous timing assumption • Processes share a clock • Timestamps mean something between processes – Otherwise processes are synchronised using a time server • Communication can be guaranteed to occur in some number of clock cycles CSC469 11/30/13
  • 13. Asynchronous timing assumption • Processes operate asynchronously from one another. • No claims can be made about whether another process is running slowly or has failed. • There is no time bound on how long it takes for a message to be delivered.
  • 14. Partial synchrony assumption • “Timing-based distributed algorithms” • Processes have some information about time – Clocks that are synchronized within some bound – Approximate bounds on message-deliver time – Use of timeouts
  • 15. Failure Model • A process that behaves according to its I/O specification throughout its execution is called correct • A process that deviates from its specification is faulty • Many gradations of faulty. Two of interest are: Fail-Stop failures A faulty process halts execution prematurely. Byzantine failures No assumption about behavior of a faulty process.
  • 16. Errors as failure assumptions • Specific types of errors are listed as failure assumptions – Communication link may lose messages – Link may duplicate messages – Link may reorder messages – Process may die and be restarted
  • 17. Fail-Stop failure • A failure results in the process, p, stopping – Also referred to as crash failure – p works correctly until the point of failure • p does not send any more messages • p does not perform actions when messages are sent to it • Other processes can detect that p has failed
  • 18. Fault/failure detectors • A perfect failure detector – No false positives (only reports actual failures). – Eventually reports failures to all processes. • Heartbeat protocols – Assumes partially synchronous environment – Processes send “I’m Alive” messages to all other processes regularly – If process i does not hear from process j in some time T = Tdelivery + Theartbeat then it determines that j has failed – Depends on Tdelivery being known and accurate
  • 19. Other Failure Models • Omission failure • Process fails to send messages, to receive incoming messages, or to handle incoming messages • Timing failure • process‘s response lies outside specified time interval • Response failure • Value of response is incorrect
  • 20. Byzantine failure • Process p fails in an arbitrary manner. • p is modeled as a malevolent entity – Can send the messages and perform the actions that will have the worst impact on other processes – Can collaborate with other “failed” processes • Common constraints – Incomplete knowledge of global state – Limited ability to coordinate with other Byzantine processes
  • 21. Setup of Distributed Consensus • N processes have to agree on a single value. – Example applications of consensus: • Performing a commit in a replicated/distributed database. • Collecting multiple sensor readings and deciding on an action • Each process begins with a value • Each process can irrevocably decide on a value • Up to f < n processes may be faulty – How do you reach consensus if no failures?
  • 22. Properties of Distributed Consensus • Agreement – If any correct process believes that V is the consensus value, then all correct processes believe V is the consensus value. • Validity – If V is the consensus value, then some process proposed V. • Termination – Each process decides some value V. • Agreement and Validity are Safety Properties • Termination is a Liveness property.
  • 23. Synchronous Fail-stop Consensus • FloodSet algorithm run at each process i – Remember, we want to tolerate up to f failures • S is a set of values • Decide(x) can be Si  {initial value} various functions for k = 1 to f+1 send Si to all processes receive Sj from all j != i Si  Si Sj (for all j) end for Decide(Si) • E.g. min(x), max(x), majority(x), or some default • Assumes nodes are connected and links do not fail
  • 24. Analysis of FloodSet • Requires f+1 rounds because process can fail at any time, in particular, during send • Agreement: Since at most f failures, then after f+1 rounds all correct processes will evaluate Decide(Si) the same. • Validity: Decide results in a proposed value (or default value) • Termination: After f+1 rounds the algorithm completes
  • 25. Example with f = 1, Decide() = min() 1 End of round 1 End of round 2 S1 = {0} 2 S2 = {1} {1} {0,1} decide 0 {0,1} {0,1} decide 0 3 S3 = {1}
  • 26. Synchronous/Byzantine Consensus • Faulty processes can behave arbitrarily – May actively try to trick other processes • Algorithm described by Lamport, Shostak, & Pease in terms of Byzantine generals agreeing whether to attack or retreat. Simple requirements: – All loyal generals decide on the same plan of action • Implies that all loyal generals obtain the same information – A small number of traitors cannot cause the loyal generals to adopt a bad plan – Decide() in this case is a majority vote, default action is “Retreat”
  • 27. Byzantine Generals • Use v(i) to denote value sent by ith general • Traitor could send different values to different generals, so can’t use v(i) obtained from i directly. New conditions: – Any two loyal generals use the same value v(i), regardless of whether i is loyal or not – If the ith general is loyal, then the value that she sends must be used by every loyal general as the value of v(i). • Re-phrase original problem as reliable broadcast: – General must send an order (“Use v as my value”) to lieutenants – Each process takes a turn as General, sending its value to the others as lieutenants – After all values are reliably exchanged, Decide()
  • 28. Synchronous Byzantine Model Theorem: There is no algorithm to solve consensus if only oral messages are used, unless more than two thirds of the generals are loyal. • In other words, impossible if n ≤ 3f for n processes, f of which are faulty • Oral messages are under control of the sender – sender can alter a message that it received before forwarding it • Let’s look at examples for special case of n=3, f=1
  • 29. Case 1 • Traitor lieutenant tries to foil consensus by refusing to participate “white hats” == loyal or “good guys” “black hats” == traitor or “bad guys” Round 1: Commanding General sends “Retreat” Commanding General 1 Round 2: L3 sends “Retreat” to L2, but L2 sends nothing Decide: L3 decides “Retreat” Lieutenant 2 CSC469 R Loyal lieutenant obeys commander. (good) R Lieutenant 3 R decides to retreat 11/30/13
  • 30. Case 2a • Traitor lieutenant tries to foil consensus by lying about order sent by general Round 1: Commanding General sends “Retreat” Commanding General 1 Round 2: L3 sends “Retreat” to L2; L2 sends “Attack” to L3 R Decide: L3 decides “Retreat” Loyal lieutenant obeys commander. (good) R Lieutenant 3 Lieutenant 2 R decides to retreat A CSC469 11/30/13
  • 31. Case 2b • Traitor lieutenant tries to foil consensus by lying about order sent by general Round 1: Commanding General sends “Attack” Commanding General 1 Round 2: L3 sends “Attack” to L2; L2 sends “Retreat” to L3 A Decide: L3 decides “Retreat” Loyal lieutenant disobeys commander. (bad) A Lieutenant 3 Lieutenant 2 A decides to retreat R CSC469 11/30/13
  • 32. Case 3 • Traitor General tries to foil consensus by sending different orders to loyal lieutenants Round 1: General sends “Attack” to L2 and “Retreat” to L3 Commanding General 1 Round 2: L3 sends “Retreat” to L2; L2 sends “Attack” to L3 A Decide: L2 decides “Attack” and L3 decides “Retreat” Loyal lieutenants obey commander. (good) Decide differently (bad) R Lieutenant 3 Lieutenant 2 R decides to attack decides to retreat A CSC469 11/30/13
  • 33. Byzantine Consensus: n > 3f • • • Oral Messages algorithm, OM(f) Consists of f+1 “phases” Algorithm OM(0) is the “base case” (no faults) 1) Commander sends value to every lieutenant 2) Each lieutenant uses value received from commander, or default “retreat” if no value was received • CSC469 Recursive algorithm handles up to f faults 11/30/13

Hinweis der Redaktion

  1. &lt;number&gt; Different aspects of the system can exhibit different degrees of synchrony/types of failures -Processor synchrony -Communication synchrony --Ordered messages --Broadcast ability or point-to-point --Atomic receive/send or distinct operations -Link failures can be considered independently from process failures.
  2. &lt;number&gt; These form a hierarchy of failure models from most (failstop) to least (byzantine) benign. Failures up to response failure can be considered relatively benign (can be detected and handled, need not result in corruption). A server that returns an incorrect response, or deviates from the specified behavior internally is hard to detect, as are arbitrary failures, which can lead to serious errors in the distributed system.
  3. &lt;number&gt; e.g. for sensors, robot soccer team – each robot has sensors to detect the environment around it. Team of robots must agree on a strategy (e.g. defense or offense), based on reaching agreement about whether team controls the ball, or other team does. Each robot begins with a value based on its sensors (team_has_ball := I_have_ball). After some consensus alg, each robot decides whether the team has the ball or not. Q: Every process sends its initial value to every other process. Every process then runs same function to decide a value based on those proposed. Assuming no failures, every process starts with the same inputs, computes same function, decides on same value  consensus. Once you allow failures, multiple rounds of communication are needed to reach agreement.
  4. &lt;number&gt; - Validity is hard to define and hard to understand. Lamport, Shostak &amp; Pease don’t use the term “validity” in the Byzantine Generals paper, but instead require that “all correct processes decide on a reasonable value” (their actual phrase is “a small number of traitors cannot cause the loyal generals to adopt a bad plan”, but we haven’t gotten into the Byzantine situation yet, and they sort of sidestep defining what a bad plan would be. Another alternate definition is “If all processes begin with the same input, then any value decided by a correct process must be that input”.
  5. &lt;number&gt; Link failure could cause the perceived failure of multiple nodes (maybe more than f) or could create partitioned sets of nodes, which could each decide the other was failed and come to consensus among themselves, but decide different values.
  6. &lt;number&gt; Requires f+1 rounds because process can fail at any time during the send operation i.e. process i can send Wi to process j, but not process k … then, set of values S at j and k will be different and they could Decide() different values. But, in next round, j and k exchange their complete sets so far, so the differences are resolved.
  7. &lt;number&gt;
  8. &lt;number&gt; Oral messages represent the typical unsigned message delivery in computer systems. This allows a faulty process to lie about the messages that were sent to it in a previous round.
  9. &lt;number&gt; Remember, this is synchronous timing assumption, so L3 knows when to stop waiting for a message from L2.
  10. &lt;number&gt; L3 has no clear majority in this case (1 attack, 1 retreat)… Two possible ways to decide: always obey general in absence of majority, or always decide on safe default (retreat). In this case, they lead to same result, let’s assume our default action in absence of majority is to retreat.
  11. &lt;number&gt; L3 again has no clear majority in this case (1 attack, 1 retreat)… This time, however, our safe default of “retreat” leads to loyal lieutenant disobeying loyal general. Bad. So it looks like our default in absense of majority should be to obey the general. But what if the general can’t be trusted?
  12. &lt;number&gt; Neither L2 nor L3 has a majority in the values they received (both have 1 attack, 1 retreat). Decision to obey commander leads to violation of consensus in this case. But, from the perspective of any loyal lieutenant, this is indistinguishable from Case 2, so other default action (always retreat if no majority) could lead to loyal lieutenant disobeying a loyal general, which also violates the requirements