I/O Communications Loss and Then Loss of Ready to Start

L

Thread Starter

lightson

When failure occurs, which initially was within ~30 minutes of rebooting "R" core, "R" core's votes are thrown out and missed vote counters will accumulate rapidly in the diagnostics missed votes section for R core.

All cores have been inspected (card status) and found at A7 prior to rebooting; then all cores are rebooted and reach A7 status again. During the reboot of each core, no outstanding flags are noted except I did see "R" register I/O board failure. This appeared to clear upon completion of reboot.

All power supply values are stable.

Immediately upon total reboot the unit has "ready to start" status which would last for about 30 minutes. During this good status there were no notable alarms coming in. P62 (common I/O comm loss) and P249 begin to toggle alarm-normal-alarm, etc. sometime during the 30 minute good status. Note: master reset will not reestablish ready to start; this has only been achieved by rebooting the "R" core.

TCQA, TCQC boards and ribbon cables have been changed along with eeprom swap to the new cards along with MK5 makes; failure time stayed at ~30 minutes.

Next, the power supply board for R (located in the back of core) was replaced. The "ready to start" status then lasted for about 1 hour.

Lastly, the LCC board was changed and ready to start status lasted for eight hours before failure.

Any help is appreciated. Thank you.
 
lightson,

Have you used the LCC display/keypad to troubleshoot which card in <R> is dropping out of A7? All cards associated with a processor (and remember--all cards associated with a processor are NOT in that processor; <R>, <S> & <T> have cards in <P> and <QDn>; <C> has a card in <CD>) have to go to A6, and then they will all go to A7. It's possible for a particular card to drop out of A7 and the processor can stay at A7. But, once all cards have gone to A6 and then A7, if a card fails or has a problem the processor will generally stay at A7 even if one or two cards drops out of A7 (and that includes going to A8 or A9). I think you need to find out which card or cards are dropping out of A7, and using the LCC display/keypads "I/O STATES" function is the only way I know of to do this.

"Common I/O Communication Failure" usually refers to a problem with <C>'s I/O not updating on the DENET (<C> being "common" to <R>, <S> and <T>). All four processors (<C>, <R>, <S> & <T>) have to go be at A7 to get a Ready to Start, and because a lot of Starting Means-related I/O is connected to <C> a failure of <C>'s I/O to get updated can cause a problem. The DENET connects <C>, <R>, <S> & <T> to each other and provides the way for the processors to share information and synchronize to each other.

I don't recall but I think the DENET connects to the TCCA, or the CTBA (sorry; it's been a long time and I don't have access to my Mark V manuals at this writing).

Has conductive grease been used on the various ribbon and cable connections in the Mark V recently? It could be that some kind of corrosion has built up on the pins/connectors and plugging them in-and-out several times, along with refreshing the conductive grease (too much is as bad as none!) could help with the problem.

Have you used Dynamic Rung Display to see what is causing L3RS to drop out? (Usually this can't be reset with a MASTER RESET, unless it's some latched alarm/logic. If it's caused by communication failures, a MASTER RESET isn't going to fix that problem.)

That's about all I can think of at this time. Please write back to let us know what you find as you work to resolve this issue.

MK5MAKEs aren't going to solve the problem. If <S> and <T> aren't having problems with the information on the EEPROM, then it shouldn't be causing a problem with <R>. A LOT of unintended problems can occur with MK5MAKEing and downloading. So, that's not a best practice. If you think the EEPROM information in <R> is suspect, it's best to just FORMAT <R>'s EEPROM (using the EEPROM Downloader's FORMAT command, perform a hard re-boot of <R> using the switch in the <PD> core; then use the EEPROM Downloader to download USER, and--even if the processor goes to A& during or shortly after the download--perform a hard re-boot of <R> (using the switch in the <PD> core).

Lastly, if you're killing the 125 VDC to the Mark V panel to re-boot all the processors/cores, that's not a best practice either. You can power them all down, one at a time (<C>, <R>, <S>, <T>, <W>, <Y> & <Z>), and then beginning with <T> boot them up one at a time. <T>, followed by <Z>, and once <T> has been at A7 for a couple of minutes, then <S>, followed by <Y>, and once <S> has been at A7 for a couple of minutes then <R> followed by <X>, and once <R> has been at A7 for a couple of minutes, then <C>. Once <C> goes to A7, you should have your 'Ready to Start' indication. But, killing 125 VDC to the panel and then applying 125 VDC to the panel without systematically powering-up the panel is not good for the power supplies or the processors. (I know; people see that happen in Mark V training at the GE plant, but there is no I/O connected to those panels, and it's a training environment--not the real world.)
 
Top