Skip to content
Related Articles
Get the best out of our app
Open App

Related Articles

Basic fault tolerant software techniques

Improve Article
Save Article
Like Article
Improve Article
Save Article
Like Article

Fault tolerance is a property of software systems that allows them to continue functioning even in the event of failures or errors. The following are some basic techniques used to improve the fault tolerance of software systems:

  1. Redundancy: This involves duplicating critical components of the software system, so that if one component fails, the others can take over and keep the system running. This can include using redundant hardware, such as redundant servers or storage systems, or creating redundant software components.
  2. Checkpointing: This involves periodically saving the state of the software system, so that if a failure occurs, the system can be restored to a previous state. This can be useful in systems that require a lot of processing time, as it allows the system to restart from a saved state if it crashes or fails.
  3. Error detection and correction: This involves detecting errors and correcting them before they cause problems. For example, error detection and correction algorithms can be used to detect and correct errors in data transmission.
  4. Failure prediction: This involves using algorithms or heuristics to predict when a failure is likely to occur, so that the system can take appropriate action to prevent or mitigate the failure.
  5. Load balancing: This involves distributing workloads across multiple components, so that no single component is overburdened. This can help to prevent failures and improve the overall performance of the system.

These are just a few of the basic techniques used to improve the fault tolerance of software systems. In practice, many systems use a combination of these techniques to provide the highest level of fault tolerance possible.

Fault tolerance means the ability of a system such as computer, network etc. will continue to work too when one or more of components fail but system will work without interruption.

The main objective of establishing the fault-tolerant system is to prevent disruptions. And these disruptions may arise due to single point of failure that ensures the high availability of Applications. also mission-critical applications for their business continuity. The Fault-tolerant systems also have the use of backup components. and these backup components will automatically take place when there is failed components which may ensures there is no loss of service. These include Power sources, hardware systems and Software systems

The study of software fault-tolerance is relatively new as compared with the study of fault-tolerant hardware. In general, fault-tolerant approaches can be classified into fault-removal and fault-masking approaches. Fault-removal techniques can be either forward error recovery or backward error recovery. Forward error recovery aims to identify the error and, based on this knowledge, correct the system state containing the error. Exception handling in high-level languages, such as Ada and PL/1, provides a system structure that supports forward recovery. Backward error recovery corrects the system state by restoring the system to a state which occurred prior to the manifestation of the fault. The recovery block scheme provides such a system structure. Another fault-tolerant software technique commonly used is error masking. The NVP scheme uses several independently developed versions of an algorithm. A final voting system is applied to the results of these N-versions and a correct result is generated. A fundamental way of improving the reliability of software systems depends on the principle of design diversity where different versions of the functions are implemented. In order to prevent software failure caused by unpredicted conditions, different programs (alternative programs) are developed separately, preferably based on different programming logic, algorithm, computer language, etc. This diversity is normally applied under the form of recovery blocks or N-version programming. Fault-tolerant software assures system reliability by using protective redundancy at the software level. There are two basic techniques for obtaining fault-tolerant software: RB scheme and NVP. Both schemes are based on software redundancy assuming that the events of coincidental software failures are rare.

1. Recovery Block Scheme – The recovery block scheme consists of three elements: primary module, acceptance tests, and alternate modules for a given task. The simplest scheme of the recovery block is as follows:

Ensure T
   By P
    Else by Q1
      Else by Q2
      Else by Qn-1
    Else Error 

Where T is an acceptance test condition that is expected to be met by successful execution of either the primary module P or the alternate modules Q1, Q2, . . ., Qn-1. The process begins when the output of the primary module is tested for acceptability. If the acceptance test determines that the output of the primary module is not acceptable, it recovers or rolls back the state of the system before the primary module is executed. It allows the second module Q1, to execute. The acceptance test is repeated to check the successful execution of module Q1. If it fails, then module Q2 is executed, etc. The alternate modules are identified by the keywords “else by” When all alternate modules are exhausted, the recovery block itself is considered to have failed and the final keywords “else error” declares the fact. In other words, when all modules execute and none produce acceptable outputs, then the system falls. A reliability optimization model has been studied by Pham (1989b) to determine the optimal number of modules in a recovery block scheme that minimizes the total system cost given the reliability of the individual modules. In a recovery block, a programming function is realized by n alternative programs, P1, P2, . . . ., Pn. The computational result generated by each alternative program is checked by an acceptance test, T. If the result is rejected, another alternative program is then executed. The program will be repeated until an acceptable result is generated by one of the n alternatives or until all the alternative programs fail. The probability of failure of the RB scheme, $P_{rb}$    , is as follows: 

    $$ P_{rb}= \prod_{i=1}^n (e_i+t_{2i})+\sum_{i=1}^n t_{1i}e_i\left ( \prod_{j=1}^{i-1} (e_j+t_{2j}) \right) $$

where $e_i$    = probability of failure for version Pi $t_{1i}$    = probability that acceptance test i judges an incorrect result as correct t$t_{2i}$    = probability that acceptance test i judges a correct result as incorrect. The above equation corresponds to the case when all versions fall the acceptance test. The second term corresponds to the probability that acceptance test i judges an incorrect result as correct at the ith trial of the n versions.

2. N-version Programming – NVP is used for providing fault-tolerance in software. In concept, the NVP scheme is similar to the N-modular redundancy scheme used to provide tolerance against hardware faults. The NVP is defined as the independent generation of $N \geq 2$    functionally equivalent programs, called versions, from the same initial specification. Independent generation of programs means that the programming efforts are carried out by N individuals or groups that do not interact with respect to the programming process. Whenever possible, different algorithms, techniques, programming languages, environments, and tools are used in each effort. In this technique, N program versions are executed in parallel on identical input and the results are obtained by voting on the outputs from the individual programs. The advantage of NVP is that when a version failure occurs, no additional time is required for reconfiguring the system and redoing the computation. Consider an NVP scheme consists of n programs and a voting mechanism, V. As opposed to the RB approach, all n alternative programs are usually executed simultaneously and their results are sent to a decision mechanism which selects the final result. The decision mechanism is normally a voter when there are more than two versions (or, more than k versions, in general), and it is a comparator when there are only two versions (k versions). The syntactic structure of NVP is as follows:

  P1(version 1)
  P2(version 2)
  Pn(version n)
  decision V 

Assume that a correct result is expected where there are at least two correct results. The probability of failure of the NVP scheme, Pn, can be expressed as 

    $$p_{nv}=\prod_{i=1}^n e_i+ \prod_{i=1}^n (1-e_i)e_i^{-1}\prod_{j=1}^n e_j + d$$

The first term of this equation is the probability that all versions fail. The second term is the probability that only one version is correct. The third term, d, is the probability that there are at least two correct results but the decision algorithm fails to deliver the correct result. It is worthwhile to note that the goal of the NVP approach is to ensure that multiple versions will be unlikely to fail on the same inputs. With each version independently developed by a different programming team, design approach, etc., the goal is that the versions will be different enough in order that they will not fail too often on the same inputs. However, multiversion programming is still a controversial topic.

The main difference between the recovery block scheme and the N-version programming is that the modules are executed sequentially in the former. The recovery block generally is not applicable to critical systems where real-time response is of great concern.

Advantages Or Disadvantages:

Advantages of using fault tolerant techniques in software systems:

  1. Improved reliability: Fault tolerant techniques help to ensure that software systems continue to function even in the event of failures or errors, improving the overall reliability of the system.
  2. Increased availability: By preventing failures and downtime, fault tolerance techniques help to increase the overall availability of the system, leading to increased user satisfaction and adoption.
  3. Reduced downtime: By preventing failures and mitigating the impact of errors, fault tolerance techniques help to reduce the amount of downtime experienced by the software system, leading to increased productivity and efficiency.
  4. Improved performance: By distributing workloads across multiple components and preventing overburdening of any single component, fault tolerance techniques can help to improve the overall performance of the software system.

Disadvantages of using fault tolerant techniques in software systems:

  1. Increased complexity: Implementing fault tolerance techniques can add complexity to the software system, making it more difficult to develop, maintain, and test.
  2. Increased cost: Implementing fault tolerance techniques can be expensive, requiring specialized hardware, software, and expertise.
  3. Reduced performance: In some cases, implementing fault tolerance techniques can lead to reduced performance, as the system must devote resources to error detection, correction, and recovery.
  4. Overhead: The process of detecting and recovering from failures can introduce overhead into the software system, reducing its overall performance.
  5. False alarms: In some cases, fault tolerant techniques may detect errors or failures that are not actually present, leading to false alarms and unnecessary downtime.

My Personal Notes arrow_drop_up
Last Updated : 06 Feb, 2023
Like Article
Save Article
Similar Reads