GE gas turbine

Dear friends in our factory we are using R, S, T GE Mark VIe Controller to running 32MW capacity turbine.
Normally a controller is selected as a designated (DC) controller.

But in our factory about 1.5 to 2 hours later two or three controllers are automatically selected DC controller at a time showing Communication lost from R /S /T PROCESSOR.
For that the machine is tripped.

Before DC selected some error are visible such as UDH EGD fault detected by S or R or T processor.

In favour of this error we already checked Switch, Ethernet cables from controller to IO cards & Controller to Cisco switch, Cisco switch to PC etc and found ok.
We also change the controller for removing EGD fault but failed.

Dear friends anybody face this problem and how you solve this issue.
 
SOOOOO, UDH EGD refers to the Unit Data Highway--the link between the HMI(s) and the Mark* VIe control processors, <R>, <S> & <T>. EGD refers to the Ethernet Global Data method of transmitting data on the UDH between the HMI(s) and the Mark* VIe control processors. I personally have never actually given much thought to how the data is transmitted, and I never asked, either. So, I'm not a lot of help on this particular error. If I recall correctly from my past experience with Mark* VI TMR controllers the data from the HMI(s) went to all three control processors, but any data that was sent to the HMI came from the Designated Processor (sometimes called Designated Controller).

My question is: Was something done to the HMI recently? Some kind of ToolboxST modification which was downloaded to the Mark VIe control processors? If something was done, WHAT WAS DONE--and why???

My comment about the "troubleshooting" that apparently was done is: I hear this ALL the friggin' time, "We tried everything and nothing was found to be at fault and we still have the same problem." No details about how any of the testing was done; no details about the specific results of the testing that was done--just the nothing changed and the problem still exists. We really can't help if we don't know what was done (specifically), how it was done, and what the results were. If you used Ethernet cable testers and found the individual cables to be okay, that's good. How were the Ethernet switches tested? USUALLY, the HMI(s) are connected to a VLAN (Virtual Local Area Network) switch which has the UDH cabling and the PDH (Plant Data Highway) cabling and possibly the ADH (Atlanta Data Highway--the name for the remote monitoring network), and a pair of cables (often, fiber optic cables) connect the VLAN in the area where the HMI(s) is(are) located to another VLAN switch in the compartment/room where the Mark VIe control processors are located. And from there, UDH cables are connected to each of the three control processors (sometimes even redundant cables to each of the three control processors). Were these VLAN switches tested--AND if so, how were they tested? AND, even MORE importantly--what were the results? Personally, I HATE the troubleshooting that stops when the problem SEEMS to be resolved (or even IS resolved) without identifying exactly what the root cause was. I've seen those problems come back time and time and time again. AND, it's always the same thing, "We tried everything and couldn't find a problem." And, yet, there was eventually a problem discovered and solved. But not after someone had to go to site and start doing many of the "same" things site personnel had tried with no success, BUT using different methods (logical methods) and recording how the tests were done AND WHAT THE RESULTS WERE.

And if you found nothing in the Mark* VIe control system that was at fault, did you just stop looking at external, third-party control systems which are or might be communicating the Mark* VIe (either through the HMI(s) or directly to the Mark* VIe controllers)?

As ControlsGuy25 wrote, you need to "capture" the Diagnostic Alarms from each of the Control Processors (using ToolboxST and making screen captures OR taking CLEAR photos) immediately after a trip event and share them with us. I have seen problems with voting happen but they are RARE and are usually the result of downloaded changes to the control processors which weren't done properly (either the changes or the downloads) OR problems with one or more of the IONET Ethernet switches which interconnect the three control processors for signal voting and sharing purposes. (These three IONET switches are usually the ones closest to the three control processors.)

The really "odd" thing about this is the timing. It would be VERY interesting to see what happens after everything is reset (and you don't really say how you're doing that, EITHER..!!!) BEFORE the turbine is restarted. Let it go for as long as possible for this test before actually re-starting the turbine again.

Another question which should probably be asked and answered is: Are there commands or data requests from a DCS or data archival/retrieval system ("historian") which are passing through the HMI(s) and the VLAN switches?

In some cases, GE actually allowed MODBUS communications to occur directly with the Mark* VIe, bypassing the HMI(s). I have seen this cause some intermittent issues with nuisance alarms, but never actually tripping the turbine--but I've not seen this option very often (and I'm glad I didn't; I'm NOT a fan of allowing non-GE controllers to directly send commands to a Mark* control system, as it usually results in some problems which are ALWAYS blamed on the Mark* and which are almost NEVER the Mark*'s fault, but since (in the beginning it's NEVER the third-party controller's fault) GE commissioning or troubleshooting personnel have to prove it's not the Mark* (and even then some people just won't believe it's not the Mark*--it is simply just SO complicated it HAS to be at fault, right? (NOT.)).

So, without a lot more detailed information and answers to the above questions and the requested data--that's about all we can say with any degree of certainty. Personally, I kind of doubt it's the UDH EGD--but it's not impossible, and it wouldn't shock me to learn it was, but we just don't have enough information about the testing, how it was done, and WHAT THE RESULTS WERE. AND, we don't know exactly how the network topology is at the site (including any MODBUS communications, either through the HMI(s) (and the number of HMI(s) which have comm links to external, third-party controllers) or directly with the Mark* VIe controllers.

Help us to help you.

Or call an experienced, knowledgeable person to site. (And watch them do many of the same tests--but with different results. Why? Because there is no record of how previous testing was done and what the results were.)
 
SOOOOO, UDH EGD refers to the Unit Data Highway--the link between the HMI(s) and the Mark* VIe control processors, <R>, <S> & <T>. EGD refers to the Ethernet Global Data method of transmitting data on the UDH between the HMI(s) and the Mark* VIe control processors. I personally have never actually given much thought to how the data is transmitted, and I never asked, either. So, I'm not a lot of help on this particular error. If I recall correctly from my past experience with Mark* VI TMR controllers the data from the HMI(s) went to all three control processors, but any data that was sent to the HMI came from the Designated Processor (sometimes called Designated Controller).

My question is: Was something done to the HMI recently? Some kind of ToolboxST modification which was downloaded to the Mark VIe control processors? If something was done, WHAT WAS DONE--and why???

My comment about the "troubleshooting" that apparently was done is: I hear this ALL the friggin' time, "We tried everything and nothing was found to be at fault and we still have the same problem." No details about how any of the testing was done; no details about the specific results of the testing that was done--just the nothing changed and the problem still exists. We really can't help if we don't know what was done (specifically), how it was done, and what the results were. If you used Ethernet cable testers and found the individual cables to be okay, that's good. How were the Ethernet switches tested? USUALLY, the HMI(s) are connected to a VLAN (Virtual Local Area Network) switch which has the UDH cabling and the PDH (Plant Data Highway) cabling and possibly the ADH (Atlanta Data Highway--the name for the remote monitoring network), and a pair of cables (often, fiber optic cables) connect the VLAN in the area where the HMI(s) is(are) located to another VLAN switch in the compartment/room where the Mark VIe control processors are located. And from there, UDH cables are connected to each of the three control processors (sometimes even redundant cables to each of the three control processors). Were these VLAN switches tested--AND if so, how were they tested? AND, even MORE importantly--what were the results? Personally, I HATE the troubleshooting that stops when the problem SEEMS to be resolved (or even IS resolved) without identifying exactly what the root cause was. I've seen those problems come back time and time and time again. AND, it's always the same thing, "We tried everything and couldn't find a problem." And, yet, there was eventually a problem discovered and solved. But not after someone had to go to site and start doing many of the "same" things site personnel had tried with no success, BUT using different methods (logical methods) and recording how the tests were done AND WHAT THE RESULTS WERE.

And if you found nothing in the Mark* VIe control system that was at fault, did you just stop looking at external, third-party control systems which are or might be communicating the Mark* VIe (either through the HMI(s) or directly to the Mark* VIe controllers)?

As ControlsGuy25 wrote, you need to "capture" the Diagnostic Alarms from each of the Control Processors (using ToolboxST and making screen captures OR taking CLEAR photos) immediately after a trip event and share them with us. I have seen problems with voting happen but they are RARE and are usually the result of downloaded changes to the control processors which weren't done properly (either the changes or the downloads) OR problems with one or more of the IONET Ethernet switches which interconnect the three control processors for signal voting and sharing purposes. (These three IONET switches are usually the ones closest to the three control processors.)

The really "odd" thing about this is the timing. It would be VERY interesting to see what happens after everything is reset (and you don't really say how you're doing that, EITHER..!!!) BEFORE the turbine is restarted. Let it go for as long as possible for this test before actually re-starting the turbine again.

Another question which should probably be asked and answered is: Are there commands or data requests from a DCS or data archival/retrieval system ("historian") which are passing through the HMI(s) and the VLAN switches?

In some cases, GE actually allowed MODBUS communications to occur directly with the Mark* VIe, bypassing the HMI(s). I have seen this cause some intermittent issues with nuisance alarms, but never actually tripping the turbine--but I've not seen this option very often (and I'm glad I didn't; I'm NOT a fan of allowing non-GE controllers to directly send commands to a Mark* control system, as it usually results in some problems which are ALWAYS blamed on the Mark* and which are almost NEVER the Mark*'s fault, but since (in the beginning it's NEVER the third-party controller's fault) GE commissioning or troubleshooting personnel have to prove it's not the Mark* (and even then some people just won't believe it's not the Mark*--it is simply just SO complicated it HAS to be at fault, right? (NOT.)).

So, without a lot more detailed information and answers to the above questions and the requested data--that's about all we can say with any degree of certainty. Personally, I kind of doubt it's the UDH EGD--but it's not impossible, and it wouldn't shock me to learn it was, but we just don't have enough information about the testing, how it was done, and WHAT THE RESULTS WERE. AND, we don't know exactly how the network topology is at the site (including any MODBUS communications, either through the HMI(s) (and the number of HMI(s) which have comm links to external, third-party controllers) or directly with the Mark* VIe controllers.

Help us to help you.

Or call an experienced, knowledgeable person to site. (And watch them do many of the same tests--but with different results. Why? Because there is no record of how previous testing was done and what the results were.)
 

Attachments

  1. Try to ping (command prompt) between the controllers and HMI and verify the communication stability
  2. During ping verify the IP addresses of (R, S, T)
  3. Try memory flashing to the controllers
 
Okay so there are some other diag alarms as shown from these screenshots

You may have some work to do on that issue I would rebuilt control architecture network topology A to Z

As you stated that you did all the necessary checks and downloading to controllers and switch and ethernet stuff

When this issue get it started comparing to the

age of the units /plant
 
To echo ControlsGuy25, how old is the unit/Mark* VIe?

When did this "Designated Controller" issue start? How long ago? After what event?

Have you tried moving the "main" IONET switches, exchanging their positions, to see if the problem follows a particular switch or lessens in any way?

EXACTLY how are you "resetting" the problem in order to get a READY TO START indication?

Have you considered trying to wait a few hours after "resetting" before re-starting the unit to see if the problem only comes up when the unit is running?

HOW MANY HMIs are communicating with the Mark* VIe?

Is there any external, third-party communications (like MODBUS or GSM) connected to the Mark* VIe?

Is there any remote monitoring and diagnostic computer connected to the Mark* VIe--either through one or more HMI(s) or directly to the Mark* VIe?

Is there any data archival and retrieval system ("historian") connected to the HMI/Mark* VIe?
 
To echo ControlsGuy25, how old is the unit/Mark* VIe?

When did this "Designated Controller" issue start? How long ago? After what event?

Have you tried moving the "main" IONET switches, exchanging their positions, to see if the problem follows a particular switch or lessens in any way?

EXACTLY how are you "resetting" the problem in order to get a READY TO START indication?

Have you considered trying to wait a few hours after "resetting" before re-starting the unit to see if the problem only comes up when the unit is running?

HOW MANY HMIs are communicating with the Mark* VIe?

Is there any external, third-party communications (like MODBUS or GSM) connected to the Mark* VIe?

Is there any remote monitoring and diagnostic computer connected to the Mark* VIe--either through one or more HMI(s) or directly to the Mark* VIe?

Is there any data archival and retrieval system ("historian") connected to the HMI/Mark* VIe?
Ans:
1)10 year
2)At healthy session it starts to come about 1.5 hours later if power reset it's become everything is ok and ready to start.
3)Main switch is manageable Cisco Switch but we bypass these switch by using an unmanageable switch for the identify of fault but same as it is.
4)Some problems come but it is resetable
5)Without Power resetting of whole system (G1 & SIS1 Controller) , it does not ready to Start.
6)It is not only coming at running but also change to DC mode at stop condition.
7)Only One HMI
8)Have a Communication between Mark VIe & DCS but this switch is damaged about three years ago
9)No
10)Ping and found ok
11)Already boot setup to controller
 
I am trying to be very specific about when this particular problem of communication FIRST started—AND what might have happened just before this particular problem started. Was it one month ago? Two weeks? Three months ago?

Did the problem start after a maintenance outage? Or a forced outage? Did someone attempt a download to the Mark* VIe panel just before the problem started? Was an I/O Pack or a UCSx controller replaced before this problem started?

The UDH is how the Mark* VIe and the HMI exchange data and commands and alarms. In a typical Mark* VIe configuration the HMI sends commands to all three control
processors but only receives data from the designated controller when the turbine is being monitored and controlled using the CIMPLICITY (or PROFICY) application. When ToolboxST is being used to communicate with the Mark* VIe controllers it is capable of communicating with one or more controllers depending on what is being done. And, this communication between ToolboxST (on the HMI) and the Mark* VIe controllers is done over the same UDH that is being used to monitor and control the turbine and driven device and auxiliaries using the graphical user interface on the HMI (CIMPLICITY/PROFICY). In the case of the Mark* VIe the alarms and events from the Mark* VIe designated controller are sent to the HMI using the same UDH network.

If you could tell us when this particular problem started and how long ago it started and what was happening just before the problem started we might be able to help you work through the problem and resolve it. In my personal opinion the problem is NOT with the UDH network—it’s with something that’s going on with one or more of the controllers that’s causing this problem, even if the Diagnostic Alarm text messages are indicating otherwise. It’s very often not the most recent alarm that indicates the root of the problem and that confuses many people especially when there are SO MANY alarms (Diagnostic and Process Alarms).

There are usually three IONET switches close to the the control processors. Those three IONET processors are interconnected and that’s how software voting is done—AND how problems with the designated controller initiate a switch to the next controller capable of being the designated controller. When that switch is done that is communicated to the HMI over the UDH. And that appears to be working properly—it’s what is causing the switch from one designated controller to the next that is the root cause which needs to be resolved. That’s why using a different switch for UDH communications isn’t resolving the problem—because the UDH is not (not very likely to be) causing the switch to the next designated controller. That’s why I was inquiring about what, if anything, has been done with those three IONET switches which are doing the data exchanges and software voting.

The TMR Mark* VIe is a very synchronized system. Everything occurs nearly simultaneously with all three controllers: reading of inputs by the three controllers; exchanging the values of those inputs between the three controllers; voting the values of inputs by each of the three controllers; executing the application code using the voted input values by all three controllers; writing to the outputs by all three controllers. If this isn’t done in a very specific time by all three controllers and the designated controller is deemed to be incapable of acting as the designated controller then the next capable controller becomes the designated controller. This is apparently happening for all three controllers, so in my personal opinion something is wrong with the IONET communication between the three controllers and THAT is resulting in the switching of designated controllers.

If a download to one or more of three controllers wasn’t done correctly or the download wasn’t successfully completed but the person doing the download didn’t see the warning and/or error message(s) and rebooted the controllers this might explain what happened.

The flash memory cards used in the UCSx controllers have been known to fail over time. Replacement often solves similar problems, but as I wrote I haven’t experienced this exact problem in my career. That doesn’t mean I don’t know how to troubleshoot it and resolve it, but without good, actionable data and information it’s not going to happen with my assistance.

I strongly recommend you have an experienced and knowledgeable person come to site to observe the problem, talk with plant personnel and work to resolve the problem. I sense a high degree of frustration, which can mean impatience and doubt—which can hinder troubleshooting and resolving this type of problem either over the World Wide Web or by someone on site.

You have decided that the problem is the UDH network—yet everything you’ve done hasn’t resolved the problem. So, you need to step back, analyze what else might be causing the problem and start logically working through that suspicion until you find and resolve the problem—or you step back and analyze what else might be causing the problem and start logically working through that.

It’s been said MANY TIMES BEFORE on Control.com: TROUBLESHOOTING IS OFTEN A PROCESS OF ELIMINATION. You have apparently eliminated the UDH network, so move on. Now you know how switching is determined (using the IONET)—how voting of input dates and values and how the application code is executed and outputs are written to.

It’s still possible that the problem is some piece of the UDH; I don’t think you have really checked every component thoroughly. Usually there are two managed Ethernet switches (one near the HMI) and one near the Mark* VIe. From the information provided it seems you have only checked one; maybe you only have one unmanaged switch and replaced one managed switch at a time. Usually the managed switches are connected with fiber optic cables; is that the case at your site? Have you (properly) tested the fiber optic cables? If it’s UTP Ethernet cables connecting the two managed switches, have you made sure the cables are NOT getting some electrical noise induced from being in close proximity to high voltage and/or high current cables? You could test this by running Ethernet cables “on the ground” instead of cable trays or conduits—just as a test, for a couple of days or so. If the problem isn’t resolved you can say with certainty it’s not the cables connecting the managed switches. Sometimes, electronic noise builds up over time and capacitance causes problems after some similar time periods. A quick test of the cables may indicate no problem, but is that a sufficient test in an industrial plant? Poor routing of cables during construction has caused many headaches years later. Or maybe someone ran new high voltage/current wires in low-voltage cable trays without understanding the possibility for inducing electrical noise in other nearby cables.

I have also seen several times where fiber optic cables were damaged years after construction cause nuisance, intermittent problems. Sometimes copper wire cable sheaths get damaged during installation but don’t start exhibiting problems until years later (from moisture and/or grounds). I’ve seen cables tie-wrapped too tightly to cable trays and when the sun beat down on them for years they were found to be chafed through and even cut by sharp edges in the cable trays (that one required a time domain reflectometer to find and it pinpointed the location to within just a few inches!).

Does the site experience electrical storms (lightning and lightning strikes)? If so, were there strikes before the problem started?

I hope this helps to understand how important good troubleshooting and notes of methods and results are. There is testing, and there’s testing. Some can be good in one case and not accurate in others. The problem will eventually likely turn out to be something unsuspected, but because the Mar* VIe is so complicated and poorly documented it gets blamed for lots of things that aren’t it’s fault even though the faults show up in the Mark* VIe. Stop. Take a deep breath. If you’re convinced that your troubleshooting of the UDH network is sufficient, then start deciding where else the problem might be and develop a plan for testing/troubleshooting. When you can’t identify the problem when there are many possibilities, troubleshooting and resolution becomes a process of elimination until you do find and resolve the problem.

Please write back to let us know how how the troubleshooting is going. And what is eventually found to be the root cause of the problem.

Process of elimination. The trick is choosing where to concentrate one’s efforts, doing a good job of testing and recording methods and results, and then choosing the next most likely root cause and doing the same thing.
 
You provided several clear photos of ToolboxST and ControlST Alarm Displays, BUT without understanding what happened prior to each photo it's very difficult to understand what is happening.

You should also note those "Frame Skip" Diagnostic Alarms in some of the alarm display photos--that's the synchronization of control processors I referred to. A "frame" is a fixed period of time where each control processor (controller) reads its I/O inputs, communicates its I/O input status (over the INOET) to the other control processors, then each control processor votes the I/O states, then each control processor uses the voted I/O state values in the execution of the application code (the program that controls and protects the turbine, driven device and auxiliaries) and then each processor writes to its outputs. This all happens at the same start time--the "synchronization" of the three control processors--and has to be completed within the same frame. Often there is unused time in the frame--called "idle time." But, this all has to happen at the same time, otherwise the processors are not going to properly control and protect the equipment--and it more than likely means the designated controller, if it's experiencing the frame skips, is deemed to be incapable of remaining the designated controller so the designation is switched to the next "available" controller. And, if this happens to any or all three of the control processors then it's likely that the UDH communications with the HMI are going to be suspended or shut down.

Again, my analysis is based solely on the information provided by you--and that's not an awful lot of information. I still think the UDH network is fine--but that's just me. I suppose it's also possible that if the designated control processor loses UDH communications with the HMI that the designated controller would be switched to the next available control processor. (You could probably test that when the equipment is not running by disconnecting the UDH cable from the designated controller and observing what happens and what alarms are generated (Diagnostic- AND Process Alarms), and then disconnecting the UDH cable to the newly designated controller and seeing what happens, and finally to the last controller to see what happens.

If the designated controller does switch when the UDH cable is unplugged from the current designated controller, then it's still possible the problem could indeed be with the UDH network--and in that case I would suspect the cabling between the managed switch nearest the HMI and the managed switch nearest the Mark* VIe control panel. (I am PRESUMING there are two managed switches--one near the HMI and one near the Mark* VIe control panel, and there is either fiber optic cabling OR Ethernet cabling between the two managed switches, and it would help greatly if you could please clarify how the UDH is configured for your site). (I'm also presuming the Mark* VIe panel is located some distance away from the control room where the HMI is located, necessitating the need for two managed switches and cabling to interconnect them.... It would help to know if this is correct, or not, please.)

BUT, this still does not mean the problem is with the UDH--it just says it COULD BE with the UDH (cables; switches; network cards (in the HMI); etc.). I also would like to know if there's a network time source connected to the UDH for the HMIs and the Mark* VIe (often called an NTP Server). Because if that's not working correctly that can greatly affect the operation of the UDH and the Mark* VIe.

AND, have you considered that the UDH problem might be the network interface card(s) (NIC) in the HMI? Often, there are redundant Ethernet ports on the HMI which connect to the nearest managed Ethernet switch. If there are two UDH cables connected to the HMI you could try disconnecting one and seeing what happens and if that solves the problem then it's a problem with that NIC output, and if it doesn't solve the problem reconnect that UDH cable, wait a minute or so, and disconnect the other UDH cable from the back of the HMI and see what happens. There are many components of the UDH--and they all have to be considered, and eliminated, as the source of the problem.

Have you noted the colors and LED status (continuously lit; flashing) of the LEDs on the UCSB controllers when this problem occurs, and then after you reset the controllers? Can you do this while the turbine and equipment is not running? I note that you said you've replaced two UCSx controllers, which kind of says it's not the UCSx's, but it could be that one of the three is initiating these frame skips and it's the one that hasn't been replaced.... You could try using of the previously removed UCSx controllers to replace the one that wasn't swapped out earlier and see what happens.

Again, troubleshooting is often a process of elimination--and when doing so while looking for the cause of the problem it's BEST to thoroughly prove the component or device or cabling/wiring is definitely not the problem. (Because if someone else comes and traces the problem to something that was already "tested" and "proven" to NOT be the problem, well, that's kind of embarassing.... ) In this case, I would say troubleshooting and resolving the problem is going to be a process of elimination--so making sure everything was tested properly the first time is going to be very important.

I wish I could say with certainty and confidence precisely what the problem is--or, what to do to troubleshoot and resolve the problem quickly and easily. But, without being on site that's just not going to happen in this case. I, again, strongly suggest you have a knowledgeable and experienced person travel to site, one with access to OEM engineers who can help with troubleshooting and resolving the issue(s). I realize that's an expensive and time-consuming process these days (visas; pandemical issues; etc.)--but if you're losing production because of the turbine tripping and not being reliable it might be more economical in the long run, and faster, too.

And, please--keep us informed about the troubleshooting and resolution of this problem. It's quite interesting, and lots of people can benefit from this thread and this problem's troubleshooting and resolution!
 
Top