Blackout of 2003

D

Donald Pittendrigh

Hi All

It is not good to be on a soapbox while ignoring modern electronic components such as UPS's and Microwave links etc. etc. as it will hurt just that little bit more when getting knocked off of it.

Regards
Donald Pittendrigh
 
C
A study in complacency, serious problems are so few and far between that they are almost a total surprise. I have fallen victim to this many times myself. The system you know the least is the important one that just runs. The ones you know best the are the unreliable ones you work on all the time. You can't take down the power net and play around or train very often. As a consequence, it's very difficult to remember clearly what to do when it barfs. I've seen quite a few systems that no one knows anything about because it's always running and must stay that way. My Linux knowledge is fading somewhat because if you aren't installing a lot of new systems, you simply don't ever do many things. I never had that problem supporting Windows :^). It's the price of success. And it's very hard, it not impossible, to justify tearing things down simply for the needed experience in bringing them back up.

So, since these guys sit at a console for years without a major incident I can see where they wouldn't be making the split second decisions needed to stop the avalanche. Perhaps there should be scheduled blackouts so they could conduct exercises and drills and test equipment. That would be good for users as well, especially if they forgot it was coming :^)

Regards

cww
 
Many of the auto trip functions or auto protection features of the various power plants did in fat funstion as designed to protect the plants from being damaged while attempting to provide the levels of power required within the grid network. The resulting cascade failure which was not managed well by the operators, resulted in power flowing the oppsosite direction in the grid from its normal direction of flow (based upon where the power is being normally supplied from vs where it wound up being supplied from), as the demand for energy continued to increase more generation facilities went off line to protect themselves from total failure.

I gleaned a fair amount of this through a lot of reading of various reports and articles published through the various industry trademags, newsweek, CNN, WSJ, the Economist, friends in the indusrty and some general knowledge of how plants typically function and the protection systems they have built in. (these auto protection syetms can be thought of as very large circuit breakers - they did their job or function in life - don't let the plant melt down, unfortuantely the opertors did not follow good pratices and manage the situation very well, because if they had, only a small portion of Ohio would have been with out power.)

All of the fancy auto time sync ideas (or any other related to improved SCADA or auto control), additonal laws to regulate response actions or the power generation industry will not cure the basic problem, humans who don't make good decisions in crisis situations, leads to massive system wide failures.


Matt Hyatt
 
M

Michael Griffin

On September 18, 2003 16:30, Patrick Allen wrote: <clip>
> This is a subject that I have brought up several times in the past few
> years. Much of the production and test equipment I have been involved with
> is PC based. With addition of searchable databases now becoming
> commonplace, computers outside the closed network are now connecting to
> retrieve or examine data. Program changes are being made by technicians
> using laptops that may have been connected to dozens of other systems
> including the internet.

Laptops have been a frequent cause of the SQL Slammer and MSBlaster worms returning to a network after they have been cleaned out. We recently had an e-mail sent around which asked us who had a visitor with a laptop. Our IT department detected that an unknown computer was connected to the network with the MSBlast worm active in it and they were trying to find out who it was to get them to unplug it. As long as people have laptops, building a fortress wall around a network is not a sufficient protective measure.

> Virus scanners are rarely installed on production PCs, as the performance
> hit would seriously affect any kind of high speed data acquisition. I
> don't even know how software such as LabView would react to having to share
> CPU time with virus checker.
Virus scanners are reactive programs. They only respond to known problems and they don't typically protect against worms. Virus scanners are incompatible with some application software (I was using some software recently that recommended shutting down any virus scanners when on-line to a servo controller).

Worms are perhaps a more serious threat to production PCs because they install themselves remotely via sercurity holes and don't require anyone to click on an e-mail attachment (MS Outlook is a virus writer's best friend). Even if some PCs are immune to any existing worms, the more active worms will take up so much availalble bandwidth in attempting to spread themselves that they will clog up the network. If I/O, instruments, or other devices are connected to the production PCs via the same network, even PCs which are not themselves directly vulnerable to the worm may not be able to reliably access the devices they need to function.

> In one recent case, one computer on a test line was equipped with a modem
> so that it could call-out for remote sessions with a software developer.
> This computer was essentially connecting to the internet naked, as no
> firewall and no updates had ever been performed on it. Even though the
> connection was dial-up and hence intermittent, a simple port scan would
> have revealed vulnerabilities inherent to all Windows machines.
<clip>

Some people got hit by the MSBlast worm when they when on-line to download the patch to protect themselves from the worm. As soon as they got a connection - bing! - the worm was in.

Whenever a new virus or worm goes around, people like to blame system administrators who haven't installed all the latest updates and patches. However, when the SQL Slammer worm went around last winter, many of Microsoft's own computers hit by it quite badly because they had not installed their own patches. If Microsoft can't keep their own systems up to date, is is reasonable to expect everyone else to?

Frequent patching is probably a bad idea for a production computer. In some cases, Windows patches have had bugs that were worse than any effect the virus or worm was supposed to have. This is why so many PCs with Windows don't have all the latest updates or patches installed. New updates and patches will come out faster than people can test, validate, and install the previous ones. Installing patches without testing for side effects is very risky, perhaps even a higher risk than the virus or worm presents. Patching system files may also require re-validating the test system which may not be a trivial project in itself.

In short, I believe the usual office solutions of firewalls, virus scanners, and frequent OS patches are of doubtful effectiveness in their intended application, and appear to be inadequate for protecting production computers.

************************
Michael Griffin
London, Ont. Canada
************************
 
So what is wrong with Windows based SCADA systems? I recently spent 4 year putting in XP based and NT based SCADA systems to keep water flowing to every house for over 100 major water districts through out Colorado and Wyoming, all of your computers had NAV installed, some were connected to the internet, most not. We never had to respond to any type of computer problem outside of hard drive failure or a monitor failure, never did we have virus issues, or other related problems related to being tied to the internet. In fact utilizing IBM based machines, I can only recall one complete computer failure out of over 300+ machines installed, and it was replaced in 2 working days by IBM, we were on site in 5 hours and had their entire system functional in 2 hours and then just waited for the new machine to show up.

Unprotected (unmaintained) systems are prone to attacks or failures, like any system (mechanical or electrical), maintenance is required, I even perform some routine maint on my machine at home and it runs 24/7/365 (has for the last sevral years) without a problem. I not would blame Windows, any operating system is only as good as it is maintained and protected.

Oh, if you don't think water is important, turn it off for 2 to 4 hours and see how fast people start screaming!

matt hyatt
 
L

Lynn at Alist

I know in this case they say it was by direct link, but don't forget the "mobile" factor these days. Even if a "Safety Network" isn't linked to any other Ethernet, where do you think the PC+Windows for programming it has been? Very likely it is one or more notebooks that move around from network to network. So these days even NO CONNECTION isn't enough since viruses can move by the new "Sneaker-net" of mobile personal computing. ;^)

- LynnL
 
R

Ralph Mackiewicz

> > The auto reclose system and various other legislated protection
> > mechanisms were not working either (sabotaged????) or selected out
> > of auto operation by someone
> <clip>
>
> Several people have mentioned systems being run in manual as being a
> possible cause. I have heard many times (including here) that power
> systems are often run in manual because the automatic systems don't
> work properly. The deficiencies of the automatic systems don't get
> addressed because the plant operators are there anyway, and they
> usually do a fairly good job.

I don't have specific knowledge of power plant operations but this isn't true of utility control centers that control the transmission systems.

> The big question about the blackout isn't why it started. Equipment
> failure producting a local blackout is a "normal" event. The real
> quesiton is why did it spread? An international commission was set up
> to investigate, but no one has come up with a satisfactory answer. I
> rather suspect though that the commission was given information about
> all the wonderful automatic features in place, but no one told them
> whether these systems actually ever really worked or not.

Utility control centers are highly automated and use a variety of software applications to prevent system collapse and minimize outages. These applications calculate and analyze power flows,
perform contingency analysis, and perform state estimations. Most utilities also keep detailed historical records of system data (many use the OSIsoft PI System). Even small utilities run these kinds of applications. Most of the engineers running the control centers I am familiar with (quite a few) would never consider operating the system without these applications. Failure of the Energy Management Systems (EMS) would not be tolerated. If the EMS was not operational that would be a major problem and this would have already been known if that was the case in this outage. You can't keep that secret.

According to reports I have read, there were highly anamolous power flows occuring immediately prior to the collapse of the system in Michigan. The problem was that these anamolies provided little or no warning of the collapse. You can't just start opening breakers when these things happen. Prudence demands that you have a pretty good idea what is going to happen before you open the breakers. It takes time for both humans and EMS to figure that out for systems as complex as electrical transmission systems. Furthermore, the systems in which these anamolies were occurring apparently did not have knowledge of the problems that were happening in separate, but connected, systems in other places. The EMS, or its operators, can't respond to external conditions when they don't know about the external conditions.

While there are system operators involved that can manually control the system, they also depend on the EMS applications. If the operators would have known of the problems in the other systems,
they might have been able to take steps to prevent the collapse. But this is conjecture. There may never be a way to determine what the
system operators would have done if they knew more then they did.

Obviously, something didn't work as planned. The first thing they are doing is looking at all the data in the historical archives to find out what happened. The second step is to find out why it
happened. It will take a significant amount of time to figure this out.

Regards,
Ralph Mackiewicz
SISCO, Inc.
 
R

Ranjan Acharya

I suppose we are going a bit off topic now...

Is not part of the problem that people on the corporate side now want the plant intranets tied in for MES and ERP and so on - they end up being tied directly or indirectly into the Internet. All one big happy network (granted, with some routers and firewalls, but still easy pickings for the latest hacks).

I only see this getting worse.

On SANS they also mentioned about a "responsibility" of security types to see to it that their neighbour's machines are protected. The excuses for not having patches / firewall / anti-virus that I have heard include:

- Those patches are too large to download with my 56 kbps analogue modem (IE 6 SP1 anyone?)
- I don't understand what the patches mean
- My copy is pirated / modified (e.g., Office Update asks for the CD in order to patch), so I can't or won't patch
- I'm safe because I have Windows 98 (only this time ...) - Anti-virus is too expensive for me
- I tried ZoneAlarm but it was too complicated
- <fill in the blanks> crashed my system
- I have Norton Anti-virus, so I'm safe right?
- I never open attachments I'm not expecting, so I'm safe right?
- I only go to reputable web sites, so I'm safe right?
- I don't care
- ...

The problem is with the OEMs, we are asking users to close the barn door after the horse has bolted.

RA
 
Another resource AList readers may find interesting is the North American Electric Reliability Council website at http://www.nerc.com

They've posted a preliminary report with details of the August 14th outage timeline, as well as a yearly summary of significant grid events, and reliability assessments.

Bob
 
S

ScienceOfficer

Personal story, little automation content:::

Leslee and I escaped Detroit for Orlando during the mid-80s GM-10 disaster, but we return as often as possible to visit friends and family. On the afternoon of Thursday, 14 August 2003, we were northbound on I-275, approaching I-696, on our way to participating in the Woodward Dream Cruise 2003, when all the radio stations suddenly disappeared...

A couple of radio stations reappeared over the next few minutes, running on emergency generators. They had no idea what was happening, but they began speculating and never stopped...

As we continued along the freeway, we could see the growing traffic jams on the surface streets, as drivers had to deal with dead traffic lights. Detroit drivers are pretty resourceful; this actually worked a lot better than you might have guessed. Still, it was slow, and traffic backed up onto the freeways. The I-696 through lanes went from congested to stopped...

Our destination was the home of Fred and Barb Collins in Berkley, an enclave city at Twelve Mile and Woodward. We drove down the exit lane of I-696 to Greenfield, then negotiated our way into Berkley. We really weren't slowed much at all, so far...

We had spoken to our friends via my cellphone while south of Detroit. During the entire blackout, our roaming, Florida based cellphones were never useful again. The system was simply oversubscribed, and didn't allow bandwidth to roamers. Incoming calls went to voicemail; outgoing calls crashed instantly. Our friends' local cellphones continued to work, and the land line system never faltered...

Fred and Barb were drinking beer on the front porch when we arrived. They were in the process of moving to their new home, and had no battery powered radios on hand. We had one of those, and more beer in our cooler. A block party formed...

Darkness approached, and the folks on the radio continued to provide no useful information whatsoever. In fact, all of the technological information was gibberish, and the planning and scheduling information was either obvious or nonexistent. Will the automobile plants be open Friday? Not if there's no power! When will this problem be fixed? We're working on it. I never heard a hard question get a hard answer.

To put a sharp point on it, we never heard authorities give an on-point answer to a single useful question during the entire blackout. The media gave great credit to the new mayor of Detroit and the new governor of Michigan for their great leadership during the blackout; I can't figure out what they did! Do they get credit for not simply breaking down and sobbing because something is happening that is totally beyond their control?

Just after midnight Friday, power was restored to the neighborhood we were in. Still, we drove to 24 Mile and Van Dyke on Saturday morning before we found ice for our Dream Cruise picnic!
 
C
Uncanny! That makes _Five_ people I've heard from who have had that experience. Of course, most of my acquantances are professional SAs until just lately.

Regards
cww
 
C
Amen!, Michael Most experienced systems people absolutely sweat bullets when they have to apply a patch to a working production system. Most simply won't unless they absolutely have to. And the policy of bunching them together ala Service Packs, has done a lot to vindicate this approach. Better to deal with the problem you know about. This is a case where fools rush in and then the phone starts ringing.....

Regards cww
 
B
Another good reason to seperate the control and business networks.

I had a bad experience with a brand new Dell computer recently. I hooked up to the business network planning to download all the latest patches. I ended up going to lunch first and by the time I got back from lunch the PC had managed to become infected with something. The IT guy was running around trying to figure out who was using up 100% of their DSL bandwidth. I wonder why their proxy server would allow a single PC to hog resources like that.

He had been able to narrow it down to a specific computer and knew its name but had no way to tell where it was. he asked me and I knew right away it was me. We unplugged it until he could disinfect it with some utility he had and than I spent several hours downloading and installing various patches off the MS website. After that i installed the antivirus software they thought they did not need to buy.

One would think that Dell would have the decency to at least install the latest patches before they ship out a PC, but for some reason they choose not to do so.

My guess is that these type of attacks will continue for the immediate future. We are just going to have to be vigilent. Probably someone will come up with a network manager software that will be able to look at the installation on a PC that connects to a business network and if it is creating a problem, just isolate that PC and report it to someone to take care of the problem.

I am a bit perturbed with AOL of late. They obviously have to know that a huge number of these spams I am getting lately are virus/worm attacks. Why don't they just filter out those messages at the mail server level? It cannot be all that hard. I am tired of getting 15-20 spam virus/worms a day.
 
F

Fred Townsend

While I am not a Dell fan but I feel I must defend them. It is a difficult task to install all the hardware ECOs before shipment much less the software ECOs. If you haven't already done so, look at the volume and frequency of "Critical Updates" from Microsoft. I have seen as many as two a day. There were probably updates released while your Dell box was in transit.

Put the blame where it belongs. I also remind you there were folks asking Bill not to release XP with such poor security. Bill scoffed at the idea of less than perfect security. It took the FBI to remind Bill Gates that XP (all WINDOWS for that matter) was a petri disk for viruses.

Fred Townsend
 
C
I've been telling customers and friends for a couple years now, to assume any Windows machine that has been on the 'net is infected. With several scans per minute, it doesn't take long. I can't see how Dell could keep current, there is often a new threat in the time it takes to process and ship. And I imagine the drives are bulk loaded and inventoried. Once they get a "trouble free" set that works with their products, they aren't going to change it without a compelling reason and pretty serious testing. If you get a virus, you don't typically blame Dell. If they introduce a serious bug when patching, it can wipe out any profit for weeks. The do have a solution however. They will load Linux or sell an empty machine now that MS can't "cut off their air supply". That would let you be sure what is on the box before exposing it to the Internet. But people want to plug and play and Damn the torpedoes.

Actually, once 30 or 40% of the PCs on the net are running Linux it should ease the situation somewhat. It's harder to get the exponential infection rates with even a little diversity. Till then it's obviously fairly simple to cause extensive destruction and grief, they've been doing it as long as there has been a monopoly. Once that most favorable situation for virus writers has ended, it won't be quite so overwhelming to deal with the problem.

Filtering is possible, but time consuming and of neccesity, behind the wave since you have to know explicitly what you are filtering for. When you're up to yer arse in alligators...... During the mailstorms generated by the MS virus of the week, most will settle for just keeping the servers up. Sometimes that's more than they can do.

Linux has, so far, been about the best way to avoid all the hassle, expense and downtime. At least it's worked for me.

Regards

cww
 
M

Michael Griffin

I am not sure how Dell runs their business, but I believe that many PC manufacturers buy their hard drives with Windows already installed. The hard drives would be loaded with software from a master image during final testing
of the drive. The master images are only changed at infrequent intervals after the PC manufacturer is sure it is compatable with their hardware. They don't want to risk shipping PCs with untested software. What you do about
service packs and patches and any problems they may cause after the PC is delivered is a matter between you and Microsoft.

If you really want your new PC to be "ready to run", you need to buy it through a local dealer (or consultant) who can do all this for you. Otherwise, with all the updating and patching that has to be done with a new PC these days, it's starting to be not much different from building your own.

--

************************
Michael Griffin
London, Ont. Canada
************************
 
D

Donald Pittendrigh

Hi All We patch every PC and do a burn-in test and print a test sheet, before they leave our office, fortunately we don't rely on PC sales to make a living as this process used to be a matter of an hours work, start the burn test and come back 2 days later and print the report. Today with all the patches and updates and other muck from muckrosoft, it takes a day of restarting and downloading to get the operating system loaded, and don't think you can download all the patches and make an install cd or an auto install over your network anymore, the patches are being updated so fast that it is impractical.

If I got such a bad incomplete and essentially flawed product anywhere else, I would insist it was replaced with a new one at the shop where I bought it, if only all the software I need to use daily would run on Linux!!!

Regards
Donald P
 
P
I guess that the message of the blackout is that diversity in computers and operating systems is good!

Peter

Peter Clout, DPhil.
Vista Control Systems, Inc.
 
M

Michael Griffin

I read in the news this weekend (28th/29th of September) that Italy just had a blackout that affected more people than the one which started this discussion (57 million vs. 50 million). This is surely more grist for the mill, and it sounds as if this type of problem (cascading failures) cannot be dismissed as an isolated local incident.

--

************************
Michael Griffin
London, Ont. Canada
************************
 
A
You could always use Ghost to create a disk image of a system. As someone that used to run a computer lab and did things the hard way, Ghost is a huge time saver. Spend a few hours to get a network server setup to send Ghost images over a network, and you can recreate a PC in
a few hours, most of which just amounts to watching a progress bar march across the screen.

As for downloading too many updates, well, install Red Hat, fire up Red Hat Network, and tell me if that's any better.

Alex Pavloff - [email protected]
ESA Technology ---- www.esatechnology.com
------- Linux-based industrial HMI ------
-------- www.esatechnology.com/5k -------
 
Top