Building Smarter, Faster Fault Handling with Real-Time Data
Even highly automated control loops struggle with fault handling. Safe, effective response requires understanding each alarm’s cause and severity and clearly communicating actionable information to operators.
Fault handling is often the source of additional process excursions due to confusion about how faults should be handled. The two major challenges are dealing with siloed knowledge and lack of standardization of fault handling.
Fault Handling Challenges
Unfortunately, the best training programs, well-written standard operating procedures (SOPs), and administrative controls struggle to compete with siloed or tribal knowledge. On-the-job training can sometimes override rules and SOPs, so the fault is handled just as the operator was trained. Different trainers mean different fault handling. A few minutes on the factory floor with operators will confirm this bias.
Fault handling is not just a training problem. In many cases, there is a lack of standardization across machines, systems, and processes, meaning that two similar faults may be handled differently in different cases. Lack of standardization and naming conventions are particularly problematic as systems grow, and new OT/IT are added over time.

Figure 1. Resetting faults without standardization prvents the ultimate root cause analysis. Image used courtesy of Adobe Stock
Collecting the Right Data: A Prerequisite to Assessing and Handling Faults
The days of collecting data now and analyzing it later are over. Real-time data collection enables real-time decision making, essential to modern process control schemes. Therefore, the first places to look for process optimization are the places where data is not collected in real-time.
Streaming a lot of real-time data with no organization or structure is only marginally better than not collecting it at all. Adding structure to the incoming data is challenging; each sensor, module, and actuator has its own format, and as the more the control system expands, the more complex the data streams become. Implementing a SCADA management platform, such as Ignition SCADA, will simplify and unify the data streams. This includes data contextualization, providing timestamps, fault events, equipment metadata, and other important information.
Ignition provides the data management and contextualization that make fault handling easier and more efficient. From there, there are three steps to fault handling: detection, understanding, and addressing faults.
Step 1: Detecting Faults
There is no way to fix a problem if nobody knows it exists. Fault detection is the obvious first step in fault handling.
The most common methods of detecting faults are to put some guardrails in place, thresholds that provide a boundary for the process variable. Whether it is maintaining the temperature of a heat treatment oven or a max current threshold for a motor, these thresholds are the first line of defense against safety and quality excursions.
Beyond thresholding, there are other predictive indicators and KPIs that can signal the onset of larger problems to come. This data can be analyzed using a number of statistical tools to develop preventative maintenance schedules and other corrective actions.
With so many possible error signals in a large set, it is important to develop a prioritization system for fault handling. For example, overcurrent conditions in the motor could be a higher priority than a slight temperature deviation. In order to prioritize properly, failure mode and effects analysis (FMEA) should be performed to determine a score for the likelihood and severity of a fault condition. This is a systematic way of evaluating possible real-world consequences of a fault and assigning each fault a ranking to ensure that the worst-case scenario is handled first. The FMEA will use current real-time data, historical norms, and past excursion conditions to assign a priority to each fault condition.

Figure 2. Preventing future faults often involves looking at motor running conditions. Image used courtesy of Adobe Stock
Step 2: Understanding Faults
A key part of fault handling is understanding why the fault is occurring in the first place. It is easy to give a surface level diagnosis of the problem, but a system like Ignition can help engineers perform a more comprehensive root cause analysis (RCA). RCA can combine traditional strategies like the 5 Whys and Fishbone diagrams with real-time data to look for trends and correlations with operators, shifts, and circumstances that would be difficult to spot at a glance.
Going one step further, RCA can help prioritize risks based on the severity and occurrence of faults. It can reduce alarm flooding, in which too many messages arrive at the same time, causing an operator to overlook high-priority faults.
Step 3: Addressing Faults
Once there is a clear understanding of how and why the fault occurred, the next step is to develop a set of action items to address the fault. There must be a clear way to assess the effectiveness of the actions, and not simply acknowledge the warning message
One of the most concerning fault handling problems is dealing with nuisance alarms. If one fault appears many times, the operator may just fall into a habit of clearing the alarm without addressing the problem. Aside from the primary issue of ignoring the real root cause, it can mean other alarms are accidentally skipped, and creates a culture where warnings (including safety warnings) are ignored.
Alarms that are classified by ISA 95 standards reduce operator response times. These standards specify the fault location (enterprise, area, machine, etc) and category (safety, quality, downtime), with data organized in a hierarchy that simplifies diagnosis.
If operators are given the context for faults, they are much less likely to overlook them. Standardized fault handling, reduction of fault floods, and prevention of nuisance alarms are vital to helping operators make proper control decisions.
Post-Fault Interactions and Continuous Improvement
A rookie mistake would be to simply address a fault and continue the process without building the fault interaction into a continuous improvement loop. KPIs, such as operator response time, alarm rates, mean time to repair (MTTR), or mean time between failure (MTBF), should all be recorded and analyzed for patterns.
Leveraging machine learning (ML) and advanced analytics on these KPIs can help engineers develop predictive maintenance models that increase machine uptime. They can help identify weak components and bottlenecks in the process and ensure that appropriate spare parts are quickly accessible.
This continuous improvement cycle is also enhanced through the use of shared dashboards. Operators, engineers, and plant managers can see the relevant information at a glance. This allows them to collaborate more easily and take corrective action immediately.
For More Information
Understanding, characterizing, and acting upon faults is essential to remain competitive in a rapidly evolving marketplace. Pairing with Inductive Automation can leverage years of experience across multiple industries to streamline this important task. Reach out to the technical staff at Inductive Automation and learn how they can help you improve fault handling at your facility.
