Are Your Systems on the Edge of Disaster?

It’s been more than 15 months now since the Great Blackout of 2003 (in which I was briefly caught, traveling to and in Cleveland), and we have had three months to ponder the findings of the commission appointed to review the causes and suggest ways to stop such problems in the future. Along with poor arboreal management (tree limbs touching 345-kilovolt power lines are clearly a bad thing), the commission singled out inadequate and uncoordinated software systems that proved unable to detect and analyze what was going on in the transmission network as a whole, even though individual parts of the system reacted just as they were supposed to. In the language of my last column, on software agents, there was no “supervisory” software (or human overseer, for that matter) who could tell the local agents to change their behavior.

This got me thinking about what failure-mode analysis calls “edge” cases, where you get so close to the operational limits of a system that any disturbance has to be handled immediately, and correctly, or the result is a catastrophic failure, and one that’s generally really hard to recover from.

Lots of examples pop up in real life. Highway traffic flows smoothly when the total load is well below the speed-specific carrying capacity of the road. Up the load a little, and (a) the average speed drops, and (b) minor accidents start to happen. Up the load above the road’s carrying capacity “limit,” and (a) traffic rapidly comes to a halt, and (b) accidents become major.

You can build a mathematical model for such events (I spent the first few years of my career in technology building models like this) and run a simulation to find out what can happen. In just about every case, the event propagation speed (how fast a change in the system affects other parts of the system) eventually exceeds the reaction time (how fast the system can adjust the operation of its parts to compensate for the change), and the system “crashes” in some way. The situation gets worse as the system becomes more complex—the integrated avionics and flight controls suite of an Airbus A320, say, or the scheduling and tracking systems for the U.S. rail network—and the number of possible adjustment/effect combinations climbs geometrically. Proving that the model matches reality perfectly is very difficult—and occasionally makes being a test pilot very interesting.

Make the system complex enough and interconnected enough, and you start to get “emergent” behaviors—the realm of complexity theory. Emergent behaviors tend to occur when there are significant differences between “local” and “total system” responses to changes in system state. Most “engineered” systems today probably aren’t complex enough to exhibit emergent behavior, but we do have three examples that probably are—the Internet, the Public Switched Telephone Network and the cellular telephone system infrastructure, which are themselves somewhat interconnected. As the business systems we build become more and more interconnected with each other over these already interconnected platforms, we ought to be asking: “Is this safe?” “What might go wrong?” “What can I do about it?”

There’s another kind of potential problem being built into our business automation that also has some scary aspects to it. More and more, the “real” data in our systems is being “wrapped” in layers of abstraction and being centralized. I have more than 2,500 entries in the contact directory on my cell phone/PDA, so I don’t bother to remember anyone’s number any more. If I want to call them I just speak their name and the phone gets the number from its memory and places the call. But what if I lose the phone? It’s not a disaster, because I have a backup, and all those numbers are available via other sources. But what if the numbers aren’t stored in the handset but instead are held in some central database the handset calls to get the latest number—and the database goes AWOL? Think that’s unlikely? A form of this happened in the financial services industry during the Sept. 11 attacks, when a lot of direct wires into New York City went down. Even though it was possible to route the calls over the PSTN (which is designed to be very reliable through redundancy and rerouting ability), no one knew what number to call.

And RFID works this way almost entirely—tag data requires access to an Object Name Service server before its contents can be resolved and acted on.

There are good reasons for centralizing data this way. It’s easier to keep the data up to date without having replication and latency overheads in the system, and it’s generally easier to secure centralized data as well. Cost and convenience are powerful design drivers, but they bring a requirement for “constant” (or instant) connectivity as well. Now we are back in the realm of infrastructure complexity that we already know has some potential pitfalls.

The pressure to optimize the use of our capital assets isn’t going to go away—so what should we be doing to ensure that little problems don’t trigger big disasters?

The first thing is to look harder at possible failure modes, and at where the edges actually are. If you don’t know how close to the edge you are, you can’t back things off a little—or intervene to move the edge further away through added capacity or improved responsiveness. A few years ago, one of the major airline reservation systems ran out of locator codes—the six- character strings that are the keys into the central reservation databases they use. Without a code, you can’t take a new reservation or issue a new ticket. In retrospect, it was easy to see how this happened—but no one had thought to build in the checks to see when the (large but finite) code space was close to fully utilized. Add in a seasonal travel spike, a price war and…oops. Panic for a couple of hours while they figured out how to free up some “previously owned” codes for immediate reuse.

The second thing is to plan for some appropriate responses. This is one of the areas where “capacity on demand” would come in handy if (a) it really were available on demand, and (b) everyone’s additional demand didn’t autocorrelate (another aspect of complex interconnected systems) so that the extra capacity arrangements are themselves rapidly overwhelmed.

Third, look harder at how to manage “graceful” degradation modes for the systems being affected. “Brownouts” and traffic slowdowns may be annoying, but allowing the system to reset itself at a less stressed state can be a workable solution—and buy time to adjust capacity that isn’t available immediately but will be soon.

Fourth, build in circuit breakers that limit how bad a disaster can be. It may be better to lose a small part of the system altogether than have the damage propagate widely and unpredict-ably, as it did in last summer’s blackout. Ideally, the system management tools would analyze the relative cost of degraded performance for everyone versus no performance for a few, and act accordingly. Also, ideally, your Service-Level Agreements and supply contracts will have such situations covered ahead of time.

And fifth, stay worried. Even with all these improved business continuity and disaster recovery plans in place, the unthinkable will still happen—and no one plans for the unthinkable. That’s the danger of emergent behavior in complex systems. They’re potential disasters waiting to happen—and one day they will.

John Parkinson is chief technologist for the Americas at Capgemini. His next column will appear in February.

Are Your Systems on the Edge of Disaster?

John Parkinson

Company

Categories