Experion Server Failure

K

Thread Starter

Kamran

We do have redundant pair of Honeywell Servers (Experion DCS System). During normal operation one (primary) server failed (probably due to overloading of Server CPU) and backup did not respond leaving console stations blank. Below is the sequence of events.

1) SERVER 2A (primary) got failed and no data was shared to the clients (console stations).

2) Checked CPU usage and found it to be on higher side i.e 60 to 80 % (Refer to attached below snapshots to see heavy services like “spoolsv.exe"," Mcshield.exe”,"HSCServer_servicehost.exe" etc

3) SERVER 2B (secondary) did not pick up the operation and continued running as backup.

4) SERVER 2A was stopped from “Start/ Stop Server” utility as Experion Station Window was not responding to do a manual failover.

5) SERVER 2B continued to Run as Backup however all CPU resources (Usage remained 20 to 25%) & adequate disk space (30GB) was also available.

6) Restarted SERVER 2A and after 15 to 20 mins SERVER 2B became Primary and indications restored.

7) When SERVER 2A logged on again, below error was observed pertaining to “server application dgamngr”.

8) Synchronized SERVER 2A and it kept running as Backup.

9) “Defragmentation Utility” was run thrice on both servers but some files (related to history and events) remained Fragmented.

following are the main concerns,
1) Why did mentioned above services consumed up max resources of CPU ?

2) Why secondary did not pick up the operation although both Servers were sync?

Please share your experience here or on my email to get rid of such failures in future.

Thanks
Kamran
[email protected]
 
J
Your best bet is to call your local TAC for help on this issue. More detail is needed (what Experion PKS release and what model servers, as two examples) before any kind of troubleshooting can take place.
 
Are you saying the CPU usage was higher before the loss of redundancy or after? Remember it may be artificially high after resynching.

In my experience the majority of redundancy issues are network related. Experion has very specific requirements in terms of network speeds, adapter bindings, metrics etc.

Have you checked Experion logs to see what the sequence of event was at the time?

Flex or console stations not connecting are usually due to a bad configuration on the station itself, or loss of view of both servers, which again points back to a network issue.
 
I communicated with the guy in a separate mail & found that he has resolved the issue.

the problem he found was:

Experion PKS server was having so many print requests (more than 6000 it seems) & there are no papers in the printer, printer is switched off. So, the service of "spoolsv.exe" is eating away most of the CPU time & hence this problem.

The problem as of now, seems to be resolved & i am happy for him.

However, i am not clear of the following:

1. How operators are accessing the servers directly & giving the print commands from the server directly? Only administrators are supposed to touch & work with the servers.

2. How come printers are configured on servers? As per the best practices of HWL, printers are not configured on the servers. You may configure them on stations.

3. How come there are more than 6000 requests for prints & no body bothered to look at them all these days? Generally i would expect a control engineer to have a routine check on the system at regular intervals & attend to any issues noticed by him.

Anyways, it is good that the guy has resolved the problem quickly & amicably.

Sastry Musti
 
Hii,

I saw this post little late. Ask your Honeywell Engineer to do a TAC System Audit on all servers. It is a checklist with around 50 points which will sort out majority of system resource issues.

Printers should not be configured in servers.
Backup server did not become primary because it was still getting responses from primary server. Even though CPU usage was high, primary server still could manage to send checkpoint updates to Server B.

regards,
tomsci
 
Thanks a lot everyone for your response. especially Sastry who helped me a lot to recover this thing.

Please note after disabling the issue never appeared again.

Basically these are printers configured on servers to generate "Event Logs" which are auto generated whenever an "Event" is created. Since Operations was not very keen to use these hard copies of "event" prints they did not keep papers in printers which in turn piled up spool requests.

Thanks for your valuable suggestions.

Regards
Kamran
 
Top