Recently, one of our two major competitors experienced a rare but fairly major system outage that took it “off the air” for an extended period. This might seem like a good thing for us–and indeed we do pick up some additional traffic (and thus revenue) when this happens.
But such events raise challenging short-term and longer-term issues that aren’t such good things.
First, we have to be able to react to a sudden surge in traffic. In the financial services business, major customers assume any of us are down if we miss responding to just a few transactions, and some of these customers can quickly reroute traffic to other sources.
When we are the recipient of this rerouting we can see rapid increases in transaction traffic–surges of 10 percent or more within a few seconds–and have to rebalance resources to avoid slowing everyone else down. We have a process for doing this and we practice it, but major outages are rare for any of us, and we don’t actually have to execute the process very often, so it’s always a little scary when we do.
Second, we get a lot of comments from our smaller business customers and sometimes from consumers questioning our ability to provide continuous service. We monitor the electronic conversation points where such comments generally show up, and it’s amazing how vocal some people get–even when they have little or no idea what actually happened or what we do to prevent (hopefully) or recover from (if we have to) such events. In this sense, none of us benefits from the customer- impacting problems that any of us may have.
Third, our larger customers rev up their audits when this happens. Customer audits are a way of life in financial services, and we welcome them as a way to build a strong relationship and better understand our customers’ concerns. But we have some challenges when the conversation turns to, “How can you prove to me that this will never happen here?” We have disaster recovery (DR) plans–and we practice those, too–but the decision to declare a disaster and move to our recovery site is a big one.
You don’t want the switch to the disaster recovery site to take longer than local recovery would take, but you can’t know in advance how long that will actually be. Whatever you do will be disruptive to some degree and, depending on the actual cause of the problem, there will probably be some service interruption.
We have minor issues all the time, but our high-availability architecture masks them from our customers, who rarely see a service disruption. That’s what high availability is for, after all. And if we do go to the DR site, we also have to figure out how to get back later–not a trivial operation and one we can’t easily practice.
So I sympathize with our competitor that had to manage the recent outage, make the decision to switch to a backup site and then figure out how to get back. Customers have processes that depend on the continuous availability of their software-as-a-service (SaaS) partners, and when the service goes away, those processes sometimes grind to a halt.
The “edge” of a modern information business isn’t where it used to be. All too often it’s right at your customers’ interaction with their customers. They are going to want you to be there with them–even if they don’t want to pay for what it takes to make that possible. That’s the business continuity challenge we all face: “five nines” or better availability at a “three nines” or lower price.