Disaster Recovery Plan and Strategy in Case of IT/OT Equipment Failure

Failures are never the goal of any industrial operation. But when disaster strikes, it’s always better to have a mitigation plan in place well in advance of the event to ensure timely recovery.

Technical Article April 24, 2023 by Munir Ahmad

Industrial control systems are industrial solutions that are largely independent computer systems that use different applications and contain business-critical data. To maintain the system's consistency and integrity and secure the business-critical data, a backup of data is taken on various machines periodically and stored in a safe and redundant environment. The data can be restored using different software tools in a short period in case of disaster.

Disaster Recovery Planning

The control system (SCADA/DCS) is installed mostly on critical industrial infrastructure, and the failure of such a system or even a single node can impose serious threats on the normal operation of the process. The consequences could be generation outage, data acquisition and communication failure, production and financial loss, and even health and safety. Therefore, it is imperative to have an urgent disaster recovery plan and restoration methodology to avoid and minimize downtime.

industrial infrastructure disaster recovery to ensure quick return to operations for critical operations.

Figure 1. Disaster recovery for industrial infrastructure can ensure a quick return to operations for critical operations. Image used courtesy of Canva

There are multiple factors that contribute to system disasters, including:

The most common factor in industrial equipment, hardware failure (hard disk, power supply).
Software issues at the application or service level or operating system issues.
Communication media and network issues due to NIC malfunction or switch/router failure.
Cyberattacks, which are real threats to industrial control systems.
Natural calamities like floods.

In the event of a disaster, we need to restore the system as soon as possible. The most common issue SCADA engineers confront while restoring the system is that they don't have the latest, valid, tested, and correct type of backups.

It is recommended to adopt best practices by developing daily and weekly maintenance check sheets and using performance monitoring tools to analyze the hardware performance, such as disk and memory utilization, network bandwidth, temperatures, power supply, etc. One such tool is the Integrated Lights-Out (iLO) feature of the HP server, a web-based interface that enables remote server management and monitoring. The iLO gives consistent insight into the health and operation of HPE ProLiant servers and gives indications well before sudden hardware failure.

Increase the System Availability by Identifying Critical Equipment Types

The first and most primary task is to list the critical and sensitive equipment in the system before working on detailed planning. The identified devices are probably from multiple vendors providing different services at multiple network levels, i.e., from data acquisition through the field devices layer to the main central system. If the plant is producing electricity, and generating units are on the busbar, then monitoring and controlling via SCADA/DCS is significant. As per practical experience, the critical assets in control systems are:

Remote terminal unit (RTU) and programmable logic controller (PLC) interfaced with field devices for data acquisition and control.
Servers and workstations such as Dell and HP servers running multiple software applications.
Network switches/routers for network LAN/WAN connectivity like Cisco 2960 48 port switch.
Realtime frontend HMI applications and database services like Oracle, etc.
Domain controllers for authentication and authorization services for operators' roles.
NTP Clock for time synchronization of IT/OT devices like RTU/PLC, IEDs, computers, and digital relays.

After sorting the primary components of the control system, it is equally important to identify the equipment category: which hardware parts are hot-swappable and which require system shutdown.

Recovery of IT/OT systems involves many networked components: cables, drives, servers, and end devices.

Figure 2. Recovery of IT/OT systems involves many networked components: cables, drives, servers, and end devices. Image used courtesy of Canva

Automatic and Manual Backup for IT / OT Assets

Transmission, distribution, and generation systems comprise digital devices, which can be grouped into ‘information technology’ (IT) and ‘operational technology’ (OT) assets. OT is employed in the daily operations of many industrial sectors, including:

Water treatment plants
Oil and gas fields
Transmission and distribution systems
Power generation facilities
Waste management systems
Manufacturing plants

OT deals with the physical world, and typical OT assets are the combination of hardware and software to control plant equipment. The equipment that comes under the category of OT assets are RTUs, PLCs, DCSs, and SCADA systems.

IT handles information and data to improve efficiency through predictive maintenance. IT assets are servers, real-time applications, and database services. Under certain circumstances, like hardware faults, the information stored on the disk memory can be destroyed. To ensure that this information is not completely lost, there should be routines for regular backup copying of the system. The backup can be separated into different activities:

Copying of the total system (system image).
Copying of the database.
Backup the RTU/PLC program, network parameters, NTP clock settings, and switch/router running configuration.

The following are the types of data that need backup.

System Image Backup

To ensure speedy recovery of the system from a total system failure, a complete system image backup is executed once a month on all Linux and MS Windows servers and workstations. The top method for restoring the system is creating a disk image to restore systems at a production site quickly. The disk image methodology replicates the original hard disk, including the operating system, proprietary applications, drives, environmental variables, and path settings. Hard disk cloning can be performed using a common image software called Acronis True Image. It is recommended that the latest system images should always be saved on a storage media like an external USB or on a Network Attached Storage (NAS) device.

HP Data Protector Manager to configure the full or incremental backup schedules.

Figure 3. HP Data Protector Manager to configure the full or incremental backup schedules.

Database Backup

The real-time plant data in different databases undergo frequent changes. As a result, a daily backup becomes necessary. Daily online database backups are executed on all Linux and MS Windows servers that contain a Microsoft SQL server or Oracle database files.

For automatic backups, Data Protector is software from HP used to back up the database and image of the entire disk of the system. The software has a feature to take backups of different operating systems automatically per the defined schedule and is saved or exported to disk storage or tape media. The following are the main features of the Data Protector, used to cut the cost and effort of the backup process:

The Data Protector provides a fully automated backup process.
The backup and restore processes are controlled, monitored, and configured from a single machine, normally called Central Backup Server. This server is the domain controller on the production and emergency Backup System for most systems.
Data backup can be taken from machines running on Linux and Windows.
Optional parallel backup.

Configuration Backup

We can take the backup of OT devices like RTU and PLC programs and restore the same in case the program is corrupted or restore the original program when the new changes in the engineering logic go wrong.

RTU560 web server to upload and download the RTU backup.

Figure 4. RTU560 web server to upload and download the RTU backup.

How to Achieve Redundancy at the Hardware Level

Multiple strategies can be integrated into the control system disaster recovery plan to achieve redundancy:

A well-known technique for cutting downtime is implementing redundancy at the hardware level by installing the two SCADA servers working in a primary (online) and standby mode. If the online SCADA server fails, the standby SCADA server takes over the role of the online server, and the plant operators in the control room (C/R) are still able to monitor and control the plant. Generally, multiple operator workstations are installed in the C/R for the operator to ensure the system's availability to view the real-time graphical plant data.
Redundant array of independent disks (RAID) is a data storage method that uses two or more hard disks to form one logical unit to achieve redundancy to prevent data loss—especially in RAID level-1, called mirroring, using at least two hard disks containing an exact copy or mirror of data.
Keep enough SCADA/DCS system spares, such as spare standalone computers, new hard disks, power supply units, communication modules, PLC/RTU I/O cards, CPU units, etc.
Bonding or teaming provides redundancy in network connectivity. If one NIC fails, the second NIC will become active.
Redundant LAN networks use two network switches to improve network availability, reliability, and scalability.

Disaster Recovery Solutions

The industrial control systems of control critical plant infrastructures and the plants running without disaster recovery plans are at high risk and may face significant downtime. Multiple strategies are discussed that can be employed to achieve hardware redundancy by formulating the SCADA disaster recovery plan.