Survey: A Question for Controls ENgineers

I'd like the control system experts on the board to consider a hypothetical situation, and describe their next few steps in analyzing and fixing the problem.

<b>The Situation:</b>

You've been experiencing periodic failures of equipment that is important in the reliable and successful completion of your process/product. You've traced the failures down to 3 or 4 components that seem to be failing on the equipment on a pretty regular basis.

Historical analysis of the equipment doesn't show any significant excursions outside of normal operating parameters. Trending analysis on the input and output points isn't showing any signs of spikes or other that precede the failures. You've adjusted your control schemes ensure all parameters are within their normal tolerances, but you can't run the system productively at those levels. You've tested every piece of copper wire and fiber that connects any system remotely involved in the operation of this equipment. You've replaced power supplies, frequency converters, anything that could be interfering with your signals. You've involved the equipment vendor in these discussions as well, and they are at a loss too, as they can't replicate your component failures in their lab under your process conditions.

Management is breathing down your neck, these component failures that can't be attributed to anything you can detect, observe, or theorize on your own are costing your company millions in lost product and busted equipment each month it goes on.

Besides quitting your job and becoming the actor who plays Neo, what's your next set of steps? What else could be going on? How do you resolve this issue? Who else do you involve? There are no wrong answers, only a description of how you would analyze this insane problem using every bit of engineering skill and knowledge you've gleaned from years of breaking and fixing things.

<i>I intend to use this responses in this thread to promote discussion in this and other online forums, so don't post if you don't want to be a part. If you don't want to be attributed, please post anonymously, but also please state your job title.</i>

Thanks Everyone!!
 
S
It would be helpful to know what types of components are failing (PLC CPU's, I/O modules, VFD's, contactors, relays, valves, etc), and what are the failure mode(s).

Also, you may have to accept that you may NEVER find out what the problem is and concentrate on trying to prevent it/treat the symptoms.

Without knowing what types of components are failing or in what manner, it's hard to make suggestions for remediation, but you might try installing components rated higher in amps, temperature, etc. Possibly adding line filters, isolation transformers, etc.

But we really need more information before we can even make any intelligent suggestions.
 
W

What-Changed-nothing

>You've adjusted your control schemes ensure all parameters are within their normal tolerances, but you can't run the system productively at those levels.

You've spelled out the issue: "can't run the system productively at those levels".

The engineers assigned to troubleshooting this process made their observations and measurements and adjusted the control scheme to within 'normal' tolerances, but operations is under a pointed gun (the almighty KPI) to produce at X% more than the level provided by 'normal tolerances'. So operations cranks up the process as soon as the engineering guys head back to their offices and the resulting stresses cause the failures.

Ecclesiastes 1:9: there is no new thing under the sun.

So, what do you do?

Scale down production's KPI so they can run within design limits. Or redesign.

job description: vendor who sees this all time.
 
D

David Mertens

We had something very similar to this on a tank terminal. The communication with the redundant profibus would fail a couple of times every week on the jetty. First we added software to detect and diagnose all hardware components, when this revealed the problem was actually more frequent than we had thought but sometimes only one of the redundant channels failed so the problem did not occur. We had already checked all the wiring, specifically the grounding and shielding but this did not solve the problem. Then we installed cameras covering 360° around the cabinet with the failing components. The results from this where that we noticed that every time there was a failure a barge came very close to the jetty and turned around to stop at the quai opposite our jetty. We concluded that the radar used by the barge in combination with the maneuver where the barge came really close to the jetty caused our problem. We replaced the watertight polyester cabinets with a type that has a metal mesh embedded in the polyester and that solved the problem.

Kind regards
David
 
you are describing a very real problem in the corporate work place, but with all such cases the answers lay in the specifics, as often the person formulating the problem does not have sufficient understanding of the problem to properly identify the real issues...hypothetically of course.
 
J

James Ingraham

1) Some time ago, there was a post on Control.com (http://www.control.com/thread/1026232237) about drives spontaneously catching on fire, with no logical reason for them to do so. It turned out that the drives were being SET on fire in an act of sabotage. By the time I got to the point described in the original question, I would certainly put several cameras on the system. In fact, even if you DON'T suspect sabotage, this isn't a bad idea. It could conceivably catch some "blink" in the state of the system.

2) I'm almost positive that the original question came about from the Stuxnet issue. I could be wrong; I just read a long article describing Stuxnet, so perhaps it's just at the front of my mind. Still, this is PRECISELY what Stuxnet did to people. Prior to Stuxnet, this never would have occurred to me. Now I would seriously look at it. Of course, there's not a lot you can do. Stuxnet was a complete unknown for months, and even after security companies found it there was a time you were just stuck with it.

3) I actually have a similar problem at a customer site this very moment. We've swapped out virtually every component, added line filters, played with parameters, and had at least a half-a-dozen different people look at the problem, plus discussions involving more people here at headquarters. The system is quite old, so what we are recommending to the customer is replacing the control system. This is, of course, non-trivial. And this is a small machine cell; I can only imagine the pain if it were something orders of magnitude larger. Still, one of the most commonly used forms of troubleshooting is to swap components and see if that fixes it. At some point, you have to give up on direct replacements and try a radical departure. Changing the "brains" (e.g. swapping out an S7-300 with a ControlLogix) is big deal, but at some point you'll have to try. In the case of Stuxnet this would work, although you wouldn't have any idea WHY it worked. In the case of sabotage it wouldn't, but that would at least tell you SOMETHING.

4) Although the original question states that recording the data has lead nowhere, I doubt that that is REALLY true. Consider Cliff Stoll's "The Cuckoo's Egg." It started with a discrepancy between two different time-accounting programs. It turned out there was hacker in the system, and he had covered his tracks well but not completely. The problem for Stoll was how to watch the hacker without letting him know he was being watched? Easy enough; he hooked onto the analog phone line that the hacker used, thus duplicating all data back and forth to a terminal he could watch. Whether the problem is malicious or accidental in nature, an independent monitoring system is a good idea. There are tools for directly watching traffic on DeviceNet, CAN, and Profibus. You can use port-mirroring and Wireshark on Ethernet. You stick a camera in front of a drive's display or the LED indicators on an I/O card. This data is fundamentally different from a trend in the PLC or HMI. It's de-coupled from things that could be causing the issue. I don't care how good the Stuxnet writers were, they can't infect a camera that's not connected to anything but a monitor.

-James Ingraham
Sage Automation, Inc.
 
Your question assumes that solving a complex Automation Software problem is somehow different from solving any problem in an industrial or manufacturing environment - Its not.

In fact, your question already implicitly assumes that the problem is of a "control system" cause (and several responders are already going down that line, while others manage to avoid the trap ). It also implicitly assumes that the reader is already following some sort of methodology (good or bad) by involving suppliers, testing components, researching historical data etc.

There are several established and proven methodologies for solving exactly these type of problems - Kepner Tregoe is just one example - but all follow the same basic process...

Identify what the problem is.
Identify what the problem isn't.
Identify what you know and don't know.
Identify and eliminate possible causes.
Identify what you can do to workaround the problem until a complete solution is found.
<b>Assume nothing, Test everything.</b>

The issues of "management pressure" and "millions of dollars" are completely irrelevant to the process of actually finding a cause, other than making it easier for management to justify spending additional money on solving the problem.

Whats most important is gathering enough knowledge ( whether that means reading manuals, asking suppliers, team-mates, workmates, asking questions on control.com etc ) to perform whatever structured problem solving process you are following.

There have already been some good suggestions of possible causes by others in this thread. These are just things to add into the problem solving process. ie Identify and Eliminate possible causes.

Importantly, its also already been pointed out that by relying in one source of data (e.g. a trend display) you are assuming this data source can truly eliminate a cause. Which crucially, in the case of something like Stuxnet, it cannot.

Rob
www[.]lymac.co.nz
 
M

Michael Toecker

Everyone,

Thank you for your responses so far, everyone has presented good approaches and there are questions that I need to provide clarification on. Please continue posting about how you would handle the situation, I'm still very interested in your troubleshooting process.

First, the intent of this post was to pick the brains of control system pros to see what they would do in response to a pretty nasty situation. I wasn't trying to solve a particular problem, which is what most posters are trying to do. I wasn't trying to get at any specific diagnosis, but the process on how you would develop one.

I remember the first time a friend of mine worked with fiber optic cable, he had no idea that you can't make extremely sharp bends in fiber optic cable and expect it to work. He was attempting to run it through existing conduit, using existing pullboxes, and was pulling his hair out trying to find out why the system wasn't working. But the answer was that he was using an unfamiliar technology without first finding out how the technology differed from older technology. And in trying to figure out the problem, he didn't have the information to include 'light doesn't behave like electrons' when troubleshooting.

James Ingraham hit this situation right on the nose with his post about Iranian engineers. I was reading the Wired article on Stuxnet (http://www.wired.com/threatlevel/2011/07/how-digital-detectives-deciphered-stuxnet/) and had a moment of professional empathy for the Iranian engineers. Here they were, wrestling with this nightmare problem, one that any one of us could have, and their troubleshooting universe didn't even include cyber security.

Now, we need to. So my next question is this: Do we as a profession have the knowledge, tools, and experience to even begin to look for a cyber security problem in our control system?

Mike Toecker
 
A colleague of mine called me out for not being straightforward in this post, and for that I'm sorry.

It's not my intention to deceive, but to inform by example, an example based on the Stuxnet infection in Iran.

So, fellow hard-hats, I'm sorry for not just coming out and saying "Stuxnet". I do hope the example helped raise awareness that cyber security is now a real failure mode in our systems now, but I'll be straightforward in the future.

Mike
 
Top