It's not a good feeling when the first voicemail you get on a Monday morning is from your CEO; he can't get into his e-mail. The fix is simple, but it takes 20 minutes to get the support resources organized to troubleshoot the problem, do the repairs and report back. Your CEO is relieved and so are you, sort of.
The incident process worked. Trouble is, you just failed your 99.999 percent availability service-level agreement for e-mail service. Now imagine if it had been the customer portal that had crashed, and your biggest customer had felt the impact.
"Five nines" availability gives you just five minutes a year of total downtime. Even "four nines" gives you only about an hour. That means you have very little time to recover when an incident is reported, and you'd better not have too many incidents. It also means you'd better think about all that planned downtime you need for the various elements of your technology, and what to do about it, before you agree to all those 24x7 SLAs.
Just patching your servers once a month blows away a lot of downtime. It can take two minutes to more than 10 minutes to reboot a single server after patches are applied; that's 24 minutes to two hours a year. And the patch process takes several minutes in many cases. Goodbye to even 99.99 percent uptime if you need two hours for patch updates on your core production servers.
There are plenty of things you can do to address this issue, but none of them are easy or cheap. Nor is it wise to tackle high availability one tactical fix at a time. You waste a lot of energy, dollars and goodwill that way. Much better to build a comprehensive, high-availability strategy, eliminating single points of failure, adding appropriate levels of redundancy and planning recovery processes that will work in the real world.
High availability is about five things:
- Well understood operational processes that all people follow all the time, even when they know how to do the work (think commercial aircraft flight-deck checklists) and are tempted to take shortcuts.
- Appropriate levels of redundancy built
into the architecture at every level, to avoid
single points of failure and ensure requisite
levels of failover, fallback and recovery.
- Monitoring of everything possible, at appropriate levels, to detect, localize, identify and even predict failures and other kinds of incidents.
- Maximum automation, to keep people
out of the loop as much as possible. People
make mistakes much more often than technology
fails and often can't or don't react fast
enough when problems occur.
- Well thought out, regularly tested recovery plans and procedures, because no matter how well you prepare, failures will sometimes happen, in ways you didn't expect, and you'll be expected to cope.
Remember, perfection has a price and you'd better be sure the markets, or at least your customers, are willing to pay it. It may be cheaper to apologize every so often than to avoid the circumstances that require you to say you're sorry. Just be sure that when you must apologize, you can show you did everything reasonable to prevent and recover from whatever went wrong.