Blackout of 2003

L

Lynn at Alist

A bit surprised no one brought up the "great black out of 2003" - how
did all of our wonderful technology fail? Specific complains I heard
even on CNN-type news programs:

1) time syncs at different companies varied such that it was impossible
to verify sequence-of-events across company borders (gee, noone's heard
of GPS or radio sync of time?)

2) All the auto/fail-safe computer systems ... didn't prevent it. (I'd
wager most were in manual and/or over-ridden by supervisors who's
year-end bonus is linked to keeping the revenue-meters rolling).

3) Utilities (& politicians) in Detroit complain that when their power
started going down nearly ONE HOUR after the start in Ohio, they had yet
to receive any notice or email or hint that there was a problem heading
their way.

Anyone else have comments? views? personal experience?

- LynnL, www.digi.com

J

John Shaw

1) >>time sync’s varied<<

I discuss this at http://www.controlviews.com/blackoutinvestigation.html. It is true, according to multiple sources, that incorrect time stamps on logs and records (from computers) hampered the investigation. This is not new. I am frequently in control rooms for many types of process plants, including power plants and power company central system control rooms. It has always disturbed me that the clocks were almost always incorrect, sometime by a couple of minutes.

While I haven’t investigated anything as big as this blackout, I have had to try to find out the cause of a problem. In some cases I might observe something happen in the plant, note the time from my wristwatch, go to the control room and check the log or trend on the DCS. On other occasions I would have to compare the logs from two different devices, such as a DCS and a PC connected to PLCs. If the time was incorrect, determining cause and effect was difficult.

It does not take new technology. I worked in a plant in the early 1970’s in which every clock was within at least two seconds of correct time. (some records were kept in milliseconds). NIST has been broadcasting the correct time on WWV for over 50 years on short wave, it is easy to set a clock correctly.

In spite of this old technology, I have seen time displayed, in seconds, on TV that was incorrect by 20 seconds, time on outside “time and temp” clocks that was 3 minutes off.

I hate to be anal about it, but if we just set our clocks to the correct time we could compare logs from two different systems. Radio Shack now sells inexpensive travel alarms with clocks set by radio to within one second of correct time.

2) I don’t know that there were any failures of computer systems. The problem was in procedures (whether implemented in computer programs or written operator procedures) and communications. After the blackout of 1965 procedures were developed to have each region isolate itself from adjacent regions when problems occurred. However, several problems have come about. In order to isolate a region, that region must be “in balance”—that is, the generation must equal consumption. With “merchant” plants producing power in one part of the country for sale in another, keeping a regional system in balance is very difficult and will require dropping some loads if the system is a net consumer at that time.

3) Communications is a serious problem. Again, it is not a technology problem but the failure to use technology. For at least 40 years information from one location has been available in control rooms in other locations. There is no reason why system operators in one region should not have detailed information about plant and power line trips in all other regions. It should be particularly easy now (I can tell you right now that the system load in the New York State Control Area is 19477 MW and that Lake Norman in North Carolina is 2.1 feet below full pond. In 1965 the power plant control room operators in North Carolina new about the North East problems as the blackout occurred.

There are procedures written to stop blackouts from spreading. These procedures do not have the force of law and are not always followed. The communications capability that has long been possible is not being used.

John Shaw
http://www.jashaw.com/pid

A

Anonymous

Sorry to be on a soap box here, But for item 3) How are you ment to send a warning email saing "my power has gone off" if the power is out. Thats back to the old addage of some who phoned a Tech support line to complain his monitor had failed. Yet when the technincian asked him to check the cables at the back he said it was to dark. So the tech. asked him to turn on the light and he replied "I Can't the powers off".

People should may think before they make daft statements to CNN etc. with out first thinking of waht they are saying.

Secondly Why has knowone brought up the idea that it was some sort of sabotage (via a virus)!. I could happen.

Then ofcourse London had the same thing 1-2 weeks later.

Routine computer failures??? or Sabotage? or Virus?

D

Dobrowolski, Jacek

Technology is only as wonderful as people taking care of that technology are smart.

Jacek Dobrowolski, Ms. Sc. E. E. Software Engineer

B

Bob Peterson

One of the alleged experts who was on TV a number of times kept blaming it on "SCADA". Several times he was asked what this was and he never was able to even get the acrynom right. Once I heard him refer to it as security something.

This guy was supposedly a big wig with the federal govt at least partially responsible for infrastructure protection at one time.

My guess is that most of the stuff is left in manual to prevent nuisance trips, and the operators and supervisors on duty were to afraid to shut down a few people's power to protect the rest of us from the blackout.

D

Donald Pittendrigh

You forgot one very important thing,

The auto reclose system and various other legislated protection mechanisms were not working either (sabotaged????) or selected out of auto operation by someone with high school principles of electricity and a little less savvy than is required to fly a plane????

Regards Donald Pittendrigh

M

Michael Griffin

Another interesting fact which leaked out after the black out was that approximately 6 months prior to this a nuclear power plant in Ohio USA had serious problems with in one of their computer systems caused by a computer worm (I think it was SQL Slammer). It knocked out the safety monitoring system (I believe this was an MMI for the safety systems). Fortunately the reactor was already shut down for other reasons so no serious consequences resulted.

The computer worm appears to have entered the plant via the business systems and then entered the control systems. Technical commentators on the situation said that commercial pressures to make use of operational data for cost reduction projects are causing companies to link their plant and business computer systems more and more closely together.

This sounds like a subject that needs to be addressed more seriously. The IT industry's approach of "patch daily and hope nothing happens" isn't a viable solution in any industry that requires high reliability.

P.S. I've just read in the news that a new hole has just been found in Windows similar to the one used by the recent MS Blaster worm. Computers that were patched to secure them from MS Blaster are still vunerable to the new problem.

L

Luke

We work almost exclusively with alarms. Since the blackout we have seen considerable increased interest. The utility operators have a discussion group through NERC and the lack of talk about Aug 14 seems to be the biggest item of interest.

Say hello to Jason for me.

LukeM, www.machineautomation.org

M

matt hyatt

Basically it boils down to humans making mistakes re: the operation of systems which can operate on their own automatically, failing to update / communicate with each other, not having auto-time sync between facilities (they could all be on GMT so logs have the same time / date stamp everywhere), auto dialers were probably disabled and to a larger degree most of these operators probably did not know what to do in this type if situation.

Unfortunately the public gets snippets of information from non-technology persons who have been briefed by a manager or other non-technology person and the like the black out, the whole story cascades into a bunch of non-sense about who did not do what they were supposed to at a given moment in time under this circumstance or what happen when and why. Now the lawsuits will fly, insurance premiums will increase and in the end the power, generation industry will not have improved to insure that such a massive blackout will not occur again in the next 20 years. 90% (or greater) or all accidents (or incidents) are related to human error (even if the transmission line which went down in the first place was due to lightening, human error caused the cascade of the event beyond the area it was located in).

The utility companies, like all good large corporations are about one thing - profit, the shareholders want a return on their money and if the companies are installing new lines, updating equipment, training operators, spending profits on improving the transmission system (which includes working together - keeping each other duly informed, ect.), the shareholders don't see big dividends (ROI), the CEO's and others who manage this massive corporations don't get big million dollar bonuses, they are all un-happy. Of course we the public are stuck with transmission systems which are older than I am and in dire need of upgrades, but the industry big wigs will tell use that the lines were designed for 50 or 60 years of operation (I guess they forgot about adding capacity to the lines (more lines - better lines) as our demand for power continues to go up every year.

The point, humans are the root cause in the cascading failure which resulting in the largest blackout in US history, not the technology.

Matt Hyatt

R

Rufus

Hanlon's Razor:
Never attribute to malice that which can be adequately explained by stupidity.

Rufus

M

Michael Griffin

Several people have mentioned systems being run in manual as being a possible cause. I have heard many times (including here) that power systems are often run in manual because the automatic systems don't work properly. The deficiencies of the automatic systems don't get addressed because the plant operators are there anyway, and they usually do a fairly good job.

The big question about the blackout isn't why it started. Equipment failure producting a local blackout is a "normal" event. The real quesiton is why did it spread? An international commission was set up to investigate, but no one has come up with a satisfactory answer. I rather suspect though that the commission was given information about all the wonderful automatic features in place, but no one told them whether these systems actually ever really worked or not.

--

************************
Michael Griffin
************************

J

John Shaw

I was about to say that I did not think that any computer "for which safety credit is taken" or that is connected to the safety system or used for safety would be connected to the internet or to any company wide system. At one time nothing safety related could be connected to non-safety systems.

However, nothing can suprise me now.

Computers used for safety or for other critical control functions in any plant should never be connected to company networks or to the internet. If you need to extract data from a critical computer for use on a non-critical information system, there are ways to provide one-way links. But there should never be a two-way link that can make any transfer into a critical computer or control system from outside.

I know that many control systems in process plants are connected to company networks and to the internet. That trend is a problem.

Which is more important: producing product or producing numbers about the product?

John

R

Ranjan Acharya

I think the technology failed because it is out of date and not very wonderful at all. I also think it failed because we pay ridiculously low cents per kilowatt hour versus the true economic and environmental cost of electricity - hardly an incentive for private or public money cash infusions for new plant and infrastructure (or conservation). Finally, if governments are too busy trying to privatise and de-regulate electricity (it started out unregulated and it was a mess, after all) then who is looking after things? I pay a "stranded debt" charge on my electricity bill every month. It is from the old Ontario Hydro (publicly owned, in massive debt, traditional centralised power generation) that was split up into two companies - one for generation and one for distribution. In order to make the companies more attractive for privatisation (either in part or as a whole), the government held back some of the debt as "stranded debt" and shafted the end users with the payment. They also ignored every expert in the field of de-regulation who told them not to bother (the status quo was obviously no good either). We generate 1% or less of our power up here in "green" Ontario by "alternative" means. I think that Denmark is around 20% and even California is approaching 20%. Pity the poor Danes.

The technology is out of date; we don't pay the true cost; the centralised generation model is out of date &c.

Y

Yan

MS Windows is source of virus. If SCADA base on MS Windows then culture of work must be very high. And industrial ethernet must be isolated from world.

Yan
[email protected]

P

Patrick Allen

This is a subject that I have brought up several times in the past few years. Much of the production and test equipment I have been involved with is PC based. With addition of searchable databases now becoming commonplace, computers outside the closed network are now connecting to retrieve or examine data. Program changes are being made by technicians using laptops that may have been connected to dozens of other systems including the internet.

Virus scanners are rarely installed on production PCs, as the performance hit would seriously affect any kind of high speed data acquisition. I don’t even know how software such as LabView would react to having to share CPU time with virus checker.

In one recent case, one computer on a test line was equipped with a modem so that it could “call-out” for remote sessions with a software developer. This computer was essentially connecting to the internet “naked”, as no firewall and no updates had ever been performed on it. Even though the connection was dial-up and hence intermittent, a simple port scan would have revealed vulnerabilities inherent to all Windows machines.

Considering that thousands of man hours worth of testing data are potentially at risk, I agree that some kind security is definitely needed.

C

Curt Wuollet

Thanks Michael,

I would have greatly preferred not to know there were people stupid enough to control a nuclear reactor with Windows. The NRC should execute commitment papers for those decision makers. I'd sleep better with the Chernobyl system in place.

Regards

cww

R

Ralph Mackiewicz

I was driving from home (north of Detroit) to Columbus, OH when the power went out. When I got to Toledo I noticed the refineries were burning a lot of excess pressure off, turned on the radio and discovered there was a blackout. After confirming that Columbus was still powered (see below), I continued on to my meeting listening to a Detroit radio station. They were desparately trying to figure out how a substation problem in Niagra Falls (what was thought to be the cause at the time) could shut the down the power over such a wide area. Numerous people were calling in and offering explanations and assigning blame. One call was from a person who claimed to be an engineer from Detroit Edison and said he could explain how this could happen in such a way that anyone could understand. Here is what he said (paraphrased):

"Image the power grid is like a big hula hoop with 12 holes in it and there is a football team arranged around the hoop with each player at a hole. If you connect a water hose to the hula hoop and all the players stay in place, water will shoot out the 12 holes at equal pressure and keep a football in the air in the center of the hoop. As long as the football is in the air, the power will be on. If one of the players covers up the hole in front of them or drops their end of the hoop, the football will fall down causing a power outage."

I almost had to pull off to the side of the road I was laughing so hard. If this guy was operating the system, that could be the cause of the outage. Investigators should find this guy quick.

The question was posed: why didn't the protection relays and reclosers operate properly to isolate the failures? They did in some places. American Electric Power (AEP) is the utility in Columbus. They issued a press release claiming that their protective relays operated properly and isolated the AEP system protecting most of their service area. BTW, AEP has been agressively automating their transmission substations (and is a large user of UCA/IEC61850 for their transmission substations). Technology, properly applied, does work.

On a personal note with respect to the blackout: Ever notice how there is an antenna at the top of the roof on a Ford Focus? Detroit Edison said it would take 3 days to get the power back on. So, I loaded up my rental car (a Ford Focus) with blackout supplies (dry ice, batteries, and water) in Columbus and headed home. I had to drive through a bad thunderstorm. Just as the storm ended, that antenna worked very effectively as a lightning rod: the car was struck by lightning while I was driving. All that was left of the antenna was a black spot on the roof of the car. The power came back on before I could make it home (28 hrs after it failed) resulting in a strange kind of disappointment that I just spent \$150 on all that stuff I didn't need.

Regards,
Ralph Mackiewicz
SISCO, Inc.

P

Peter Whalley

Issues relating to IT security in general are discussed at length on the SANS web site ( see http://www.sans.org/rr/ ) as an example and daily security bulletins on security are issued by SANS. Go to http://portal.sans.org/ to subscribe. I've found these very usefull in keeping up to date with security issues.

Today's bulletin indicates the IT comunity is in a mad panic over the latest Microsoft DCOM vulnerability. See http://isc.sans.org/diary.html?date=2003-09-11 for details. The SANS bulletin states "..Acting on this vulnerability immediately is absolutely critical.."

There was also an editorial comment posted in a SANS bulletin a few weeks ago which commented: "...[Editor's Note (Ranum): Repeat after me: Mission critical systems should be on isolated networks that are not connected to the Internet. There is no amount of web surfing fun that justifies the cost and labor downside of an incident such as the one above..."

Certainly the more enlightened in the IT Security community understand the issues involved in connection to the Internet but many have yet to learn the lesson.

WRT to the power black out I suspect the layers have been rushing around telling the techs to keep their mouths shut until the law suits have been completed.

B

Bouchard, James $CPCCA$

I think it would be wise to wait till the international commission has completed its work and submitted a report. You do not investigate this type of event in a couple weeks.

The automatic systems do work. Quebec was not affected by this problem because it's systems worked. The same systems have reduced the scope of outages and improved the restart time for major outages. The most recent example was last year when smoke from forest fires in northern Quebec tripped a major line. The automatic protection shed enough load in Montréal to keep the system up and running at about two thirds capacity. Within a few hours everything was back in operation. The reason it took so long was a problem at one substation in particular that took a longer time to get back on line.

James Bouchard