Technical Article

A Software Approach to Calculating Mean Time to Repair (MTTR) and Mean Time Between Failure (MTBF)

December 07, 2021 by Shawn Dietrich

MTTR and MTBF are metrics utilized to determine equipment downtime. How do you calculate these metrics, and how can a control engineer use them for machine reliability?

What is Mean Time to Recovery (MTTR)?

In automation software, MTTR (mean time to recovery) is a metric used to determine how easily equipment can be diagnosed and restored to automatic operation. MTTR is the average time taken to restore a machine to a productive state and provides a good baseline for machine efficiency. When used with other indicators, it could point controls engineers and maintenance staff to possible nuisance faults that can be reduced.

Figure 1. Industrial machinery.

Machine Failures

Every machine has machine faults or alarms that stop the machine and notify an operator or maintenance staff. Sometimes these faults are easily cleared, and sometimes they direct maintenance staff to a component that is not responding the way it should. Using MTTR calculations can assist a controls software developer in making fault descriptions clear and easy to interpret.

Calculating MTTR

Engineers can easily calculate MTTR manually or, using basic logic, write in the machine controller so that trained technicians can review the values anytime the machine is in production. The formula is as follows.

MTTR = The sum of time to recover from a fault / the number of faults.

For example, say a machine fails 10 times throughout a shift and the amount of downtime related to those faults is 25 minutes. Then the MTTR = 25 min / 10, so MTTR = 2.5 min. Depending on the fault, this could mean the fault is not clear enough for an operator to react fast enough.

Using MTTR

Once the MTTR has been calculated, the software developers and controls engineers can use that information to know how efficient and descriptive the faults and alarms are. The MTTR also shows the speed of the operator and maintenance staff in addressing faults.

Figure 2. Monitoring industry machinery and noticing spikes due to failures. Image used courtesy of Chris Liverani

To remove some errors due to operators not being at the machine or maintenance staff not clearing faults, we can change the time recording device to only look at the time it takes an operator to acknowledge the fault. By looking at the time it takes for somebody to acknowledge a fault divided by the number of faults, we understand how long it takes an operator to get to the machine, read the fault, and then act. This can be useful information when determining why overall equipment effectiveness (OEE) is so low or why the downtime is so high.

What is Mean Time Between Failure (MTBF)?

MTBF is the mean time between failures; it is the average time between one system fault and the next. Sometimes, as controls software designers, we can’t do much about this because some faults are mechanical or device-related and will continue to occur until the device or mechanical interference is corrected. However, if we use MTBF for operator-related faults, we can determine how often an operator makes a mistake and causes a loss in production.

Operator Faults

An operator fault occurs anytime an operator needs to interact with the automated equipment to continue its task. This could be installing a component that is too difficult to automate, or loading the equipment with parts to be assembled.

Wherever an operator is required to interact with automated equipment, there should be sensors or vision systems to check that the operator has completed the task correctly. These systems will have faults if they have detected the operator has incorrectly performed their job. These faults are valid and will slow the production rate of the machine if they happen too frequently.

Calculating MTBF

Calculating MTBF is very similar to MTTR, except we are looking at specific faults. The formula is relatively simple.

MTBF = The sum of time between faults / the number of faults.

However, this calculation is only useful when looking at very specific faults, and it doesn’t need to only be for operator faults. In addition, if you are receiving recurring faults and want to know the impact on your recovery, the MTBF can be subtracted from other metrics.

Figure 3. MTTR and MTBF can help prevent issues on the production line, which can slow efficiency and increase costs.

Combining Metrics for Machine Reliability

Machine owners or manufacturing engineers will typically use the OEE of a machine to determine how well the machine is making parts, but other metrics can drill down into the root of the problem. By using MTTR or MTBF and adjusting which faults you are looking at, you can determine which fault is causing the highest OEE reduction and longest downtime.

As controls software designers, we can use MTTR to develop cleaner, more robust faults that can be easily understood by operators or maintenance staff. We can use MTBF to provide a more accurate picture of the machine OEE. By combining these metrics, we develop a clearer, more accurate snapshot of machine efficiency, which is important to not only the equipment manufacturer but also the customer and production staff.