Difference between Redundancy and Fault Tolerance


Thread Starter


I have read the following in some article regarding the difference between redundancy and Fault tolerance.
The goal of Redundancy is, by using duplicated equipment, to improve the availability of station. Here primary will be active and secondary will be idle.

However the goal of Fault tolerance is, by using duplicated equipment, to improve the availability of station and also to Eliminate bad signals going to the field due to hardware failure. Here both the primary & secondary will be processing logic under normal conditions."

I would like to know for which industries/process applications, the fault tolerance concept is must for main control systems when compare to redundancy concept(I am aware that for ESD systems, Fault tolerance concept is better and is implemented in most of ESD systems).

Also which control system vendor has this fault tolerance concept implemented in their system/controllers? I have come across many vendors who uses redundancy concept in their controllers.
It is confusing, especially for the non-English-speaking reader. The problem is "how to make a system more reliable." The solution is through the use of fault-tolerance.
There are different approaches to fault-tolerance:
1. make the hardware more reliable and resistant to many faults
2. use duplicate hardware to back-up parts of the system
3. use software and systems approaches to bypass failed system elements
4. all of the above.
Note that redundancy solves SOME problems, but not all. The question becomes one of fault detection and fault recovery. How important is it that every fault is detected and the system provides a correction for that fault? Not
every fault must be corrected. For example, measurement faults due to random noise need not be corrected, but only eliminated from causing
erroneous controls. Only when noise becomes persistent does any action need to be taken.

Many suppliers will tell you that triple modular redundancy (TMR) is the answer to reliability problems. It is an excellent answer because it
provides a fault detection solution and a rapid, bumpless recovery method. It suffers from being very expensive, but only you can determine if the
extra cost is balanced by the speed of recovery. Many times the TMR approach can be less expensive than using high cost ruggedized hardware.

Redundancy is often used for automation networks by tradition and because it is not too expensive. TMR is often used for Safety Shutdown Systems in explosive hazard processes because of the danger of loss of human life.

I hope this helps.

Dick Caro
Richard H. Caro, CEO
CMC Associates
2 Beth Circle, Acton, MA 01720
Tel: +1.978.635.9449 Mobile: +1.978.764.4728
Fax: +1.978.246.1270
E-mail: [email protected]
Web: http://www.CMC.us

Hakan Ozevin

Just to give an idea, simply talking, a reduntant (H:High available) system works with "OR" logic. Controller 1 or controller 2 or ... controller n takes the control, in case of a problem in the others.

A fail safe (F) system works with "AND" logic. If and only if controller 1 and controller 2 and ... controller n exist, the process goes on.

The choice of a safety system depends on the probability of occurrance and possible damages (minor upto catastrophic). That information should be provided by the process supplier. However, as an idea:

In case of a fault, if the process will be in a safe state when the system shuts down, then F systems are used, e.g. boilers, presses, chemical plants.

Otherwise and if shutting down the system causes great losses, H systems are used, e.g. airplanes, synthetic textile industry.

There is also F/H system, which is a combination of two methods ("OR and AND"),e.g. those used in nuclear plants.

It is also possible to divide a process into H,F, F/H and non-safe parts. For example, for a press, we can use an F system for starting and stopping of the movements (using emergency/two-hand relays of category 4) and leave the rest of the control to an ordinary PLC (non-safe).

You can look at Siemens products where SIMATIC S5-95F, S5-95F/P, S7-300F, S7-400H, S7-400F and S7-400F/H systems exist.

I also recommend looking at tesch.de, where there is a classification/explanation of safety categories.
Some GE turbine control systems are good examples of systems using fault tolerance. A turbine trip is very expensive, and you wouldn't want it to happen due to the failure of 1 field device, a pressure switch for example.

The way GE does fault tolerance is like this:
3 main control processors, each with their own field devices. Let's call them "A", "B", and "C". If a critical pressure switch for one of the systems fails, let's say the "C" bearing oil pressure switch, the system "votes" the 3 inputs. Since "A" shows a good input and "B" does too, the bad "C" input is outvoted, and things continue to run, with the bad "C" input giving an alarm so it can get repaired. The way the system handles analog values is with median value select "voting" such that the middle value of the 3 field devices is chosen as the control value.
This system also incorporates redundancy, since the failure of 1 entire processor, "A", "B", or "C" will also allow operation to continue.