It’s not a good feeling when the first
voicemail you get on a Monday morning is
from your CEO; he can’t get into his e-mail.
The fix is simple, but it takes 20 minutes to
get the support resources organized to troubleshoot
the problem, do the repairs and
report back. Your CEO is relieved and so are
you, sort of.
The incident process worked. Trouble is,
you just failed your 99.999 percent availability
service-level agreement for e-mail service.
Now imagine if it had been the customer portal
that had crashed, and your biggest customer
had felt the impact.
“Five nines” availability gives you just
five minutes a year of total downtime. Even
“four nines” gives you only about an hour.
That means you have very little time to
recover when an incident is reported, and
you’d better not have too many incidents. It
also means you’d better think about all that
planned downtime you need for the various
elements of your technology, and what
to do about it, before you agree to all those
24×7 SLAs.
Just patching your servers once a month
blows away a lot of downtime. It can take two
minutes to more than 10 minutes to reboot a
single server after patches are applied; that’s
24 minutes to two hours a year. And the patch
process takes several minutes in many cases.
Goodbye to even 99.99 percent uptime if you
need two hours for patch updates on your
core production servers.
There are plenty of things you can do to
address this issue, but none of them are easy
or cheap. Nor is it wise to tackle high availability
one tactical fix at a time. You waste
a lot of energy, dollars and goodwill that
way. Much better to build a comprehensive,
high-availability strategy, eliminating single
points of failure, adding appropriate levels of
redundancy and planning recovery processes
that will work in the real world.
High availability is about five things:
- Well understood operational processes
that all people follow all the time, even
when they know how to do the work (think
commercial aircraft flight-deck checklists)
and are tempted to take shortcuts. - Appropriate levels of redundancy built
into the architecture at every level, to avoid
single points of failure and ensure requisite
levels of failover, fallback and recovery. - Monitoring of everything possible, at
appropriate levels, to detect, localize, identify
and even predict failures and other kinds of
incidents. - Maximum automation, to keep people
out of the loop as much as possible. People
make mistakes much more often than technology
fails and often can’t or don’t react fast
enough when problems occur. - Well thought out, regularly tested recovery
plans and procedures, because no matter
how well you prepare, failures will sometimes
happen, in ways you didn’t expect, and
you’ll be expected to cope.
Remember, perfection has a price and
you’d better be sure the markets, or at least
your customers, are willing to pay it. It may
be cheaper to apologize every so often than to
avoid the circumstances that require you to
say you’re sorry. Just be sure that when you
must apologize, you can show you did everything
reasonable to prevent and recover from
whatever went wrong.