SlideShare a Scribd company logo
1 of 14
FAULT TOLERANT COMPUTERS SYSTEM DESIGN A.Jayaprakash M.Sc., M.Net., D.DiSP., [email_address]  / 919730984888
What is fault tolerance? Fault-tolerance  or  graceful degradation  is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components.   Fault-tolerance is particularly sought-after in high-availability or life-critical systems. As an example in real life environment, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages.  A five nines system would therefore statistically provide 99.999% availability. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage.
How Fault-tolerance meets data storage in our life: Spare components address the first fundamental characteristic of fault- tolerance in three ways: Replication:  Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;  Redundancy:  Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover);  Diversity:  Providing multiple  different  implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.  Real time Example: All implementations of RAID, redundant array of independent disks, except RAID 0 are examples of a fault-tolerant storage device that uses data redundancy.
The major design goals: In engineering,  Fault-tolerant design , also known as  fail-safe design , is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails.  Components If each component, in turn, can continue to function when one of its subcomponents fails, this will allow the total system to continue to operate, as well.  Redundancy This means having backup components which automatically "kick in" should one component fail.
When to use: Providing fault-tolerant design for every component is normally not an option. In such cases the following criteria may be used to determine which components should be fault-tolerant: How critical is the component?  In a car, the radio is not critical, so this component has less need for fault-tolerance.  How likely is the component to fail?  Some components, like the drive shaft in a car, are not likely to fail, so no fault-tolerance is needed.  How expensive is it to make the component fault-tolerant?  Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
Disadvantages: Interference with fault detection in the same component.  Interference with fault detection in another component.  Reduction of priority of fault correction.  Test difficulty.  Cost.  Inferior components.  Related terms There is a difference between fault-tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly  fault resistant . But when a fault did occur they still stopped operating completely, and therefore were not  fault-tolerant .
Fault-tolerant computer systems: Fault-tolerant computer systems  are systems designed to meet the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults. Basic Block Diagram:
Types of fault tolerance: Most fault-tolerant computer systems are designed to be able to handle several possible failures , Such as  hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source. Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. the above figure clearly defines the construction design of fault tolerance system, here all the units having doubled.
Evaluation in Fault tolerant Computer systems: The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonin Svoboda; its basic design was magnetic drums connected via relays, with a voting method of memory error detection   Most of the development in the so called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960's, in preparation for Project Apollo and other research aspects.   IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys, and General Electric built their own. IBM and Tandem Computers (Now Acquired by HP) built their entire business on such machines, which used single point tolerance to create their  NonStop  systems with uptimes measured in years.
Uses of FTCS: The fault tolerant system has wide range of applications, now a day each application are considered as Mission Critical application, no company compromise with system down time in their application. To meet business continuity each vendor in market tries to bring every system that they design under Fault – Tolerant design Major industry usage of fault tolerant systems are; 1. Airline industry 2. Research Labs 3. Aerospace Centers 4. BFSI 6. ATM Transaction  7. Telemedicine 8. Military services Like the list extends according to the applications, these systems ensure the Business continuity and continuous access.
Fault tolerance verification and validation: The most important requirement of design in a fault tolerant computer system is making sure it actually meets its requirements for reliability. This is done by using various failure models to simulate various failures, and analyzing how well the system reacts.  The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF or LASS in Europe   Fault tolerance research: Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work   The field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more.
[object Object],[object Object],[object Object],[object Object]
Thank you

More Related Content

What's hot

Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESS
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESSComputer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESS
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESSDr. SELVAGANESAN S
 
Distributed operating system(os)
Distributed operating system(os)Distributed operating system(os)
Distributed operating system(os)Dinesh Modak
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitivesStudent
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemDHIVYADEVAKI
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architectureMaulik Togadiya
 
CSMA /CD PPT ON SLIDESHARE
CSMA /CD PPT ON SLIDESHARECSMA /CD PPT ON SLIDESHARE
CSMA /CD PPT ON SLIDESHAREKhushboo Pal
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing modelsishmecse13
 
TCP over wireless slides
TCP over wireless slidesTCP over wireless slides
TCP over wireless slidesMahesh Rajawat
 
Processor allocation in Distributed Systems
Processor allocation in Distributed SystemsProcessor allocation in Distributed Systems
Processor allocation in Distributed SystemsRitu Ranjan Shrivastwa
 
Design issues for the layers
Design issues for the layersDesign issues for the layers
Design issues for the layersjayaprakash
 
Network Layer design Issues.pptx
Network Layer design Issues.pptxNetwork Layer design Issues.pptx
Network Layer design Issues.pptxAcad
 
Task communication
Task communicationTask communication
Task communication1jayanti
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure callsAshish Kumar
 
Operating Systems: Device Management
Operating Systems: Device ManagementOperating Systems: Device Management
Operating Systems: Device ManagementDamian T. Gordon
 

What's hot (20)

Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Network architecture
Network architectureNetwork architecture
Network architecture
 
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESS
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESSComputer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESS
Computer Networks Unit 2 UNIT II DATA-LINK LAYER & MEDIA ACCESS
 
IT6601 MOBILE COMPUTING
IT6601 MOBILE COMPUTINGIT6601 MOBILE COMPUTING
IT6601 MOBILE COMPUTING
 
Distributed operating system(os)
Distributed operating system(os)Distributed operating system(os)
Distributed operating system(os)
 
Distributed Operating System_1
Distributed Operating System_1Distributed Operating System_1
Distributed Operating System_1
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
CSMA /CD PPT ON SLIDESHARE
CSMA /CD PPT ON SLIDESHARECSMA /CD PPT ON SLIDESHARE
CSMA /CD PPT ON SLIDESHARE
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing models
 
TCP over wireless slides
TCP over wireless slidesTCP over wireless slides
TCP over wireless slides
 
Processor allocation in Distributed Systems
Processor allocation in Distributed SystemsProcessor allocation in Distributed Systems
Processor allocation in Distributed Systems
 
Design issues for the layers
Design issues for the layersDesign issues for the layers
Design issues for the layers
 
Network Layer design Issues.pptx
Network Layer design Issues.pptxNetwork Layer design Issues.pptx
Network Layer design Issues.pptx
 
Distributed information system
Distributed information systemDistributed information system
Distributed information system
 
Task communication
Task communicationTask communication
Task communication
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Operating Systems: Device Management
Operating Systems: Device ManagementOperating Systems: Device Management
Operating Systems: Device Management
 

Similar to Fault Tolerance System

Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9Ian Sommerville
 
Design approach for fault
Design approach for faultDesign approach for fault
Design approach for faultVLSICS Design
 
Unit 1_Operating system
Unit 1_Operating system Unit 1_Operating system
Unit 1_Operating system JayeshGadhave1
 
Simulation-based fault-tolerant multiprocessors system
Simulation-based fault-tolerant multiprocessors systemSimulation-based fault-tolerant multiprocessors system
Simulation-based fault-tolerant multiprocessors systemTELKOMNIKA JOURNAL
 
PriyaDharshini distributed operating system
PriyaDharshini distributed operating systemPriyaDharshini distributed operating system
PriyaDharshini distributed operating systemPriyadharshiniVS
 
Vanmathy distributed operating system
Vanmathy distributed operating system Vanmathy distributed operating system
Vanmathy distributed operating system PriyadharshiniVS
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridadMarketing Donalba
 
An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computingijtsrd
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
Resiliency vs High Availability vs Fault Tolerance vs  ReliabilityResiliency vs High Availability vs Fault Tolerance vs  Reliability
Resiliency vs High Availability vs Fault Tolerance vs Reliabilityjeetendra mandal
 
Embedded system software
Embedded system softwareEmbedded system software
Embedded system softwareJamia Hamdard
 
High dependability of the automated systems
High dependability of the automated systemsHigh dependability of the automated systems
High dependability of the automated systemsAlan Tatourian
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded SystemsAngela Hauber
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded SystemsMEN Mikro Elektronik GmbH
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded SystemsMEN Micro
 

Similar to Fault Tolerance System (20)

Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9
 
Ch13.pptx
Ch13.pptxCh13.pptx
Ch13.pptx
 
Ch13
Ch13Ch13
Ch13
 
Design approach for fault
Design approach for faultDesign approach for fault
Design approach for fault
 
Unit 1_Operating system
Unit 1_Operating system Unit 1_Operating system
Unit 1_Operating system
 
Simulation-based fault-tolerant multiprocessors system
Simulation-based fault-tolerant multiprocessors systemSimulation-based fault-tolerant multiprocessors system
Simulation-based fault-tolerant multiprocessors system
 
PriyaDharshini distributed operating system
PriyaDharshini distributed operating systemPriyaDharshini distributed operating system
PriyaDharshini distributed operating system
 
Vanmathy distributed operating system
Vanmathy distributed operating system Vanmathy distributed operating system
Vanmathy distributed operating system
 
Atifalhas
AtifalhasAtifalhas
Atifalhas
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Chap3
Chap3Chap3
Chap3
 
An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computing
 
Resiliency vs High Availability vs Fault Tolerance vs Reliability
Resiliency vs High Availability vs Fault Tolerance vs  ReliabilityResiliency vs High Availability vs Fault Tolerance vs  Reliability
Resiliency vs High Availability vs Fault Tolerance vs Reliability
 
Embedded system software
Embedded system softwareEmbedded system software
Embedded system software
 
High dependability of the automated systems
High dependability of the automated systemsHigh dependability of the automated systems
High dependability of the automated systems
 
DSDConference07
DSDConference07DSDConference07
DSDConference07
 
Rain technology seminar
Rain technology seminar Rain technology seminar
Rain technology seminar
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems
 
5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems5 Techniques to Achieve Functional Safety for Embedded Systems
5 Techniques to Achieve Functional Safety for Embedded Systems
 

More from prakashjjaya

EMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughEMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughprakashjjaya
 
Brocade Administration & troubleshooting
Brocade Administration & troubleshootingBrocade Administration & troubleshooting
Brocade Administration & troubleshootingprakashjjaya
 
Intro To Cloud Computing
Intro To Cloud ComputingIntro To Cloud Computing
Intro To Cloud Computingprakashjjaya
 
Introduction To San
Introduction To SanIntroduction To San
Introduction To Sanprakashjjaya
 

More from prakashjjaya (6)

EMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughEMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_through
 
Brocade Administration & troubleshooting
Brocade Administration & troubleshootingBrocade Administration & troubleshooting
Brocade Administration & troubleshooting
 
Intro To Cloud Computing
Intro To Cloud ComputingIntro To Cloud Computing
Intro To Cloud Computing
 
Introduction To San
Introduction To SanIntroduction To San
Introduction To San
 
Storage Networks
Storage NetworksStorage Networks
Storage Networks
 
Itil V3
Itil V3Itil V3
Itil V3
 

Fault Tolerance System

  • 1. FAULT TOLERANT COMPUTERS SYSTEM DESIGN A.Jayaprakash M.Sc., M.Net., D.DiSP., [email_address] / 919730984888
  • 2. What is fault tolerance? Fault-tolerance or graceful degradation is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. Fault-tolerance is particularly sought-after in high-availability or life-critical systems. As an example in real life environment, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded.
  • 3.
  • 4. How Fault-tolerance meets data storage in our life: Spare components address the first fundamental characteristic of fault- tolerance in three ways: Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum; Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover); Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation. Real time Example: All implementations of RAID, redundant array of independent disks, except RAID 0 are examples of a fault-tolerant storage device that uses data redundancy.
  • 5. The major design goals: In engineering, Fault-tolerant design , also known as fail-safe design , is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails. Components If each component, in turn, can continue to function when one of its subcomponents fails, this will allow the total system to continue to operate, as well. Redundancy This means having backup components which automatically "kick in" should one component fail.
  • 6. When to use: Providing fault-tolerant design for every component is normally not an option. In such cases the following criteria may be used to determine which components should be fault-tolerant: How critical is the component? In a car, the radio is not critical, so this component has less need for fault-tolerance. How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault-tolerance is needed. How expensive is it to make the component fault-tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
  • 7. Disadvantages: Interference with fault detection in the same component. Interference with fault detection in another component. Reduction of priority of fault correction. Test difficulty. Cost. Inferior components. Related terms There is a difference between fault-tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant . But when a fault did occur they still stopped operating completely, and therefore were not fault-tolerant .
  • 8. Fault-tolerant computer systems: Fault-tolerant computer systems are systems designed to meet the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults. Basic Block Diagram:
  • 9. Types of fault tolerance: Most fault-tolerant computer systems are designed to be able to handle several possible failures , Such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source. Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. the above figure clearly defines the construction design of fault tolerance system, here all the units having doubled.
  • 10. Evaluation in Fault tolerant Computer systems: The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonin Svoboda; its basic design was magnetic drums connected via relays, with a voting method of memory error detection Most of the development in the so called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960's, in preparation for Project Apollo and other research aspects. IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys, and General Electric built their own. IBM and Tandem Computers (Now Acquired by HP) built their entire business on such machines, which used single point tolerance to create their NonStop systems with uptimes measured in years.
  • 11. Uses of FTCS: The fault tolerant system has wide range of applications, now a day each application are considered as Mission Critical application, no company compromise with system down time in their application. To meet business continuity each vendor in market tries to bring every system that they design under Fault – Tolerant design Major industry usage of fault tolerant systems are; 1. Airline industry 2. Research Labs 3. Aerospace Centers 4. BFSI 6. ATM Transaction 7. Telemedicine 8. Military services Like the list extends according to the applications, these systems ensure the Business continuity and continuous access.
  • 12. Fault tolerance verification and validation: The most important requirement of design in a fault tolerant computer system is making sure it actually meets its requirements for reliability. This is done by using various failure models to simulate various failures, and analyzing how well the system reacts. The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF or LASS in Europe Fault tolerance research: Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work The field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more.
  • 13.