Analysis: Managing the Unthinkable
Know the Risk: Digital Transformation's Impact on Your Business-Critical Applications REGISTER >
For most, disaster recovery and business continuity is a theoretical exercise. For many, however, it became frighteningly real a year ago. Deputy Editor Terry A. Kirkpatrick interviewed two IT executives who worked at or near Ground Zero about their work lives during the past year. Both had extensive disaster recovery plans in place before Sept. 11, but nothing prepared them for that day, or for the weeks and months that followed. Now, mostly recovered and thinking about future risks, both menone, the CIO of a Wall Street securities brokerage; the other, the CIO of a public housing authoritysay they're discovering the limits that the economy and the nature of their operations are placing on what they can do, realistically, to make their organizations less vulnerable to disaster.
Jonathan E. Beyman is the CIO at Lehman Brothers, whose IT offices occupied three floors of the World Trade Center's North Tower. Cary B. Peskin is CIO of New York City's Department of Housing Preservation and Development, whose offices remain four blocks east of the World Trade Center site. Though their organizations are quite different, the CIOs' experiences and thinking today are remarkably similar. Here, in their own words, are some of the lessons from Sept. 11.
I'll call you if I can find you
E-mail to Lehman Brothers CIO Jonathan Beyman from Robert Schwartz, managing director and CTO; Sept. 11, 9:28 a.m.
Where are You???
E-mail from Jonathan Beyman to Richard Greenbaum, Lehman's managing director, global systems and technology; Sept. 11, 10:55 a.m.
Jonathan Beyman: I was trying to figure out who was alive. I'm in a meeting in London, and somebody bursts in, and we all huddle around TVs. "That's my building," I'm thinking. My office was in One World Trade Center, the 40th floor. I can't get anybody on the phone. Everybody there was heading toward the stairwells. I couldn't get anybody on mobile phones, either. Blackberries were working, so we started getting e-mails. I called one of our guys in Jersey City (N.J.), a guy who works for me. Our facility there looked right across the Hudson River at the World Trade Center. While I'm on the phone with him, he starts screaminghe had just watched the other plane fly into the second tower. And I was watching it on TV. It was weird and horrible.
Cary Peskin: I had called a meeting at 8:30 in my office, and we were sitting at the conference table when the building shook. My back was toward the window, and somebody gestured and said, "Look, there's smoke coming out of the North Tower!" And I turned around, and I saw a huge gaping hole in the tower. We didn't know what happened. So I called my wife, and she turned on the news and told me that a plane hit the tower. I was standing facing the towers, I'm watching this in real time, and she was watching it on TV, and at that moment the second plane flew into the south tower, and we both shrieked. The ball of fire came right toward this building.
Thus began a year of recoverynew systems, new procedures and, most importantly, a whole new way of thinking. The definition of "disaster" had changed forever.
Beyman: We always thought of disaster recovery as a short-term outage, a fire in a data center, let's say. We were very good around systems recovery. We had backup sites for mainframes. We were okay on that stuff. We had our main data center in the World Financial Center [adjacent to the towers]. It was water-cooled, and when the tower fell, the air-conditioning water pipes were severed. So the data center got very hot. Some of the internals on the machines were 180 degrees. The machines report out to EMC, and EMC told us what was happening. They said the viscosity of the oil in the fans started to give way. The machines were throwing off all kinds of errors. So within 12 hours or so of the towers falling, the data center had to be shut down, and we switched to our backup center in Jersey City. All the equipment in Manhattan was lost.
Peskin: My building was completely engulfed in debris from the collapse of the towers. They wouldn't let us out of the building. We shut down the ventilation system, and that prevented a lot of the airborne particles from getting in. We had two diesel backup generators. One failed. The second was oil-fired, and we didn't have an unlimited fuel supply. As we started losing services, I realized the environment was becoming very unstable, and I felt it would be best to bring down the infrastructure in a controlled manner. I made the decision to shut down everything except for the Internet and e-mail. We wanted to maintain whatever outside communication we could. But we eventually lost the Internet and e-mail, because the Internet connection we share with other agencies was destroyed, and we had no means of reaching the outside world.
Also knocked out by water and debris were 300,000 voice lines and 3.5 million data circuits in lower Manhattan. A simple phone call became an impossible luxury.
Beyman: We never really thought about telephone connectivity to our customers as something that would be disrupted in the long term, and, quite honestly, that was our biggest single problem in the recovery. The way trading floors work is, I'm sitting at a trading desk, I'm a government trader, and you're a big client, PIMCO out in San Francisco, let's say. You don't dial me on the phone. You press a button on your desk that lights up a button on my desk. It's called an ARD, an automatic ring-down. I pick up the phone and talk to you, and you and I talk every day. When we lost the Financial Center, we didn't have that line anymore. So the guy from PIMCO had to dial a number in Jersey City that he probably didn't know. So that was problem No. 1. Problem No. 2 was there were so many calls to Jersey Citywe were there, Merrill was there, Goldman had guys there, Deutsche Bank had guys therethat their central office was overwhelmed. Then we also didn't have enough phone capacity for all of the traders moving to Jersey City; it wasn't set up as a trading center. We didn't spend the money to make sure we had the same ability to receive inbound phone calls in Jersey City. Again, the concept was if we ever have to fail over to there, switch to a standby backup system, it's just for a day or an hour. It's not for months.
Recovery required a lot more than just fixing technology. Beyman lost an employee in the collapse of the tower, one of the more than 600 IT employees working there for Lehman.
Beyman: We all went to lots of funerals. Wall Street is a very insular place. There were a lot of ex-Lehman employees at Cantor Fitzgerald, which lost so many people. I knew a lot of them. It was a very difficult time. The firm brought in counselors, and we tried to do everything possible to help people through it. I had to do different things as a leaderyou have to over-communicate, talk about issues constantly, hold lots of conference calls. When you do this, you have to be very open and honest.
Peskin: We've had some very tragic losses here at the agency. We didn't lose employees, but we did lose a number of relatives of employeeschildren, parents. One employee lost two children. Lots of cases of trauma, anxiety. People walk into the building, they have anxiety attacks. The agency has taken quite an aggressive role in bringing in grief and trauma counselors. It has brought in similar professionals to provide support to senior staff members, because although I have my 80, 90, 100 people to worry about, who's going to worry about the CIO?
Beyman: People were scared. One day, weeks after the attack, a computer in Jersey City got smokey. People got jumpy and started to flinch and think, 'Oh my God, fire.' It was just one of these spontaneous evacuations. I remember getting so angry, myself, that I said to Joe Gregory, my boss and the COO, that I'm going to use body fluids, if necessary, to put out the fire. I worked too hard. I wasn't going to let a computer fire ruin everything.
Managers also had to learn new ways to reorganize work, and do it in less-than-ideal working conditions for a prolonged period of time.
Beyman: At first, we had 1,000 people in the Jersey City backup site, and we ended up having 3,000. I mean, people were on top of each other. The rest of the people were scattered all over the place. We put banking into the Sheraton Hotel in midtown Manhattan. They took the entire hotel. We had people out in Long Island, in Happauge. We had people in Princeton, N.J. Everybody who would give us space, we threw people into.
The Internet, built to withstand a war, did help to keep people connectedand in new ways. After Sept. 11, both Peskin's and Beyman's work forces were dispersed. The housing department building was turned into a relief center for two weeks. Lehman's more than 6,000 employees in lower Manhattan couldn't return to their offices.
Beyman: We wanted to restore all of our old phone numbers because for all these sales guys, that's their lifeblood. So we switched to VoIP (Voice over IP). Whenever they change offices, they take their phone numbers with them. The second thing is, because we were moving so many people, IP was easier for installs. It's actually pretty cool technology. I can pick up this phone and move it to any other jack, and when it reboots, it reboots with the same phone number because it's on the IP network. It's where the technology world is going. You know what? You have to put all this stuff back together, so why not take advantage of new things?
Peskin: We began to look at ways to allow people to conduct the same work but not necessarily from this building. So what you do is move your applications out to the Internet. If anything happens to this building, if we're Internet-based, we can conduct business from any workstation that has Internet access. So we're pushing more out to the Internet, not only in terms of disaster recovery planning, but also the city is moving toward an e-business model where the public can do a lot of what they would generally do in person with the agencies. It's a direction we were going in before Sept. 11, but Sept. 11 provided that much more motivation. We have not promoted telecommuting, but it seems we should be taking a more serious look at it.
Beyman: We never thought about what it would mean to have a totally mobile work force and what the implications of that would be on our technology. A lot of people didn't come to work for a couple of weeks, because we had no place to put them. We had previously developed this software called Tocket, a secure connection through a firewall. And we Web-enabled all of our applications so that from all these 40-something locations where people were dispersed, all anybody needed was a connection to the Internet. Once they had that, they could get at everything we had internally. We built that out, and we've maintained that, and now every application that we put out, we put out that way. To the extent there's got to be dispersion, we can handle it. The fact that we can work in a distributed fashion is a tremendous productivity improvement. Anybody who travels can get at any application in the firm from an Internet kiosk in an airport.
By December, Lehman was beginning to use a new data center in Manhattan that replaced the one it lost. Both Beyman and Peskin have significantly improved the way their systems are backed up.
Beyman: There are places where we needed manual intervention to activate the backup site, and we're trying to eliminate all of that now. So if I had a data center evacuation in London, I can now run the London data center from New York and vice versa. The new thought is, the servers need to be in different places and they need to seamlessly fail over. That's not even in a disaster scenario. The fact is, this redundancy helps your business. Automatic fail over is just good business, because today any system event is a customer event. If a system goes down, the customer can't get through to you.
Peskin: We've moved toward building a very large storage area network with a disaster recovery failover off the property. We did not have that before, so we have our SAN in place now, and the next phase is going to be building a mirror SAN at another site. It's going to mirror all activities that are occurring at the primary site. That will be a hot siteevery time we cut a transaction here, you can see it there almost instantaneously.
Beyman: In the world we're in today, we can see a situation in which we're displaced from this building, and so we now have hot trading floors in New Jersey that people can get out to and start doing their business again. On Sept. 11, we had space there, we had a network, we had connectivity, we had computers. And we recovered the first day in terms of funding ourselves, and recovered in the debt business in two days. But when we had done our disaster recovery thinking, we never anticipated trading out of New Jersey for the long term. As it turned out, we had people in Jersey City for six months, actually trading. So we ended up, in very short order, building out facilities and making them more permanent, even knowing that we were moving back to Manhattan. But the idea was, "Gee, you know what, we're going to leave those in place, and if we ever have to move out of our new building in Manhattan, we're going to always have something else availablejust because we've now been burned once."
Peskin and Beyman also have a new appreciation for something so obvious that nobody had thought about it much before: If systems are to work, people have to get to them.
Peskin: We cannot assume anymore that we can always get people on the property here, and that's why we're looking at a mirror site. We had tape backups, but if you can't get people on the property to your systems, restoring them is obviously just not doable. We also bought UPS units, battery backups. If there is a problem, we have enough time to bring the system down before any damage results. They can call out to a pager, and our administrators can go into the system remotely. That was on our drawing board. What Sept. 11 did, if there was anything positive that came out of it, was that it accelerated that plan.
Beyman: On Sept. 11, we didn't know where anybody was. You start contacting your teams, find out who's where, who's going to Jersey City, can they get there? You have to be somewhat nimble in these things because you never know. It's funny, I had always gone to Jersey City, but I always took the ferry or the PATH train from Manhattan. I didn't know how to get there from my home in Connecticut. So I had to call my secretary from London, and I said, "Linda, get me directions so when I get home on Saturday I can go there."
But not every protection Beyman and Peskin can imagine is feasible. Both have increased their spending on disaster recovery, but there are limits.
Beyman: We have about 7,000 people in Manhattan. We can say, "Gee, let's have a hot standby for 7,000 people if we get displaced from Manhattan." But you could never afford that. At some point, there are economic decisions and probability decisions. Our spending on business continuity has gone up, absolutely, because we're focusing on a lot of things we haven't focused on before. We're spending quite a bit of money to put in a SONET Ring (a self-healing communications network) so we don't lose direct contact with customers. And before Sept. 11, we never would have allocated the space in Jersey City so that we could immediately start trading there.
Peskin: Our timetables on all these business continuity things are tough to forecast, because we're balancing an economic crisis as well. Clearly, our funding is not what we had prior to Sept. 11. The city's revenue stream has been reduced quite a bit, and the budgets of the various agencies were impacted, and technology does not have an immunity to that. Let's set technology aside for a minute. More important is that people may be homeless for a while longer because we don't have funding to construct as much housing as we would like. Technology is one thing. But you're down to basics now. People are on the street. That is a far more compelling need than a new Internet-based application. So that's where some of the very tough decisions have been and are still being made.
Another constraint: Disaster planning is limited by business demands. Though Lehman's offices in the World Financial Center, adjacent to the towers, didn't collapse, they were uninhabitable for months. Lehman bought a new building in midtown Manhattan and began using new trading floors there in February. Despite talk about the need to disperse highly concentrated work forces, Lehman decided it would not be feasible.
Beyman: This building in midtown Manhattan has 5,000 people; it's a centralization point. We didn't want to buy a building here and a building there. While there may be some disaster recovery or business continuity benefit to that, clearly it's inefficient to run these capital markets and investment banking businesses that way. We're not 100 percent done on this thinking, but the thinking is that you still can't run trading businesses remotely. You don't want traders sitting in their houses, and there's a whole bunch of reasons whyrisk and also some of the speed of the technology and all that. You're still at risk, I suppose, for an event, a bomb going off outside of this building, and a loss of life associated with that. But as it relates to the loss of the facility, we've made a lot of plans. Jersey is one, but also there's remote connectivity from almost any place.
A year later, both men say one other thing has changed: the reach of their imaginations.
Beyman: The issue was always a failure of imagination. I mean, we had 6,000 people displaced. If you start with that statement, if you start by saying, "You know what? Did we ever think we would be displaced from lower Manhattan for, well, foreverand let's put a disaster recovery plan together around that?" Economically, you just couldn't have gotten it done. Now, of course, it's a different story.
Peskin: I don't think anybody ever in their wildest imagination thought that these towers would have collapsed. It has raised our level of awareness to the point that almost the first thing we consider is the risk. Disaster recovery is part of any discussion we have today. Any discussion.
Beyman: Now, you just think differently. You imagine an evacuation from New York City. You think about things longer-term, and about a world of possibilities that are infinitely more horrible than any you ever thought about before. This is, still, pretty real stuff to us.