Pressures Mount on CIOs to Deliver All the Time

John Parkinson Avatar

Updated on:

It’s not a good feeling when the first
voicemail you get on a Monday morning is
from your CEO; he can’t get into his e-mail.
The fix is simple, but it takes 20 minutes to
get the support resources organized to troubleshoot
the problem, do the repairs and
report back. Your CEO is relieved and so are
you, sort of.

The incident process worked. Trouble is,
you just failed your 99.999 percent availability
service-level agreement for e-mail service.
Now imagine if it had been the customer portal
that had crashed, and your biggest customer
had felt the impact.

“Five nines” availability gives you just
five minutes a year of total downtime. Even
“four nines” gives you only about an hour.
That means you have very little time to
recover when an incident is reported, and
you’d better not have too many incidents. It
also means you’d better think about all that
planned downtime you need for the various
elements of your technology, and what
to do about it, before you agree to all those
24×7 SLAs.

Just patching your servers once a month
blows away a lot of downtime. It can take two
minutes to more than 10 minutes to reboot a
single server after patches are applied; that’s
24 minutes to two hours a year. And the patch
process takes several minutes in many cases.
Goodbye to even 99.99 percent uptime if you
need two hours for patch updates on your
core production servers.

There are plenty of things you can do to
address this issue, but none of them are easy
or cheap. Nor is it wise to tackle high availability
one tactical fix at a time. You waste
a lot of energy, dollars and goodwill that
way. Much better to build a comprehensive,
high-availability strategy, eliminating single
points of failure, adding appropriate levels of
redundancy and planning recovery processes
that will work in the real world.

High availability is about five things:

  • Well understood operational processes
    that all people follow all the time, even
    when they know how to do the work (think
    commercial aircraft flight-deck checklists)
    and are tempted to take shortcuts.
  • Appropriate levels of redundancy built
    into the architecture at every level, to avoid
    single points of failure and ensure requisite
    levels of failover, fallback and recovery.

  • Monitoring of everything possible, at
    appropriate levels, to detect, localize, identify
    and even predict failures and other kinds of
    incidents.
  • Maximum automation, to keep people
    out of the loop as much as possible. People
    make mistakes much more often than technology
    fails and often can’t or don’t react fast
    enough when problems occur.

  • Well thought out, regularly tested recovery
    plans and procedures, because no matter
    how well you prepare, failures will sometimes
    happen, in ways you didn’t expect, and
    you’ll be expected to cope.

Remember, perfection has a price and
you’d better be sure the markets, or at least
your customers, are willing to pay it. It may
be cheaper to apologize every so often than to
avoid the circumstances that require you to
say you’re sorry. Just be sure that when you
must apologize, you can show you did everything
reasonable to prevent and recover from
whatever went wrong.