The hardware and software platforms on which we run our businesses are remarkably reliable, and we take that reliability for granted. Few of us consider just how this level of reliability is achieved. Historically, we haven't stressed the platforms all that hard; we've used only a fraction of the theoretical capability of the technologies we deploy. But that's about to change.
As the pressure to improve asset usage pushes technologies such as virtualization and grid computing closer to the mainstream, usage rates of our three main technology resources-- compute cycles, storage and connectivity--will rise, and those rates will begin to approach the theoretical limits of capacity. Then not only will we need better instrumentation and monitoring tools, but we may also be forced to deal with some surprising reliability issues.
One of my clients inadvertently stepped into this uncharted territory. The company discovered a set of circumstances under which a combination of hardware and software that had always performed flawlessly hit a threshold, causing a cascade of failures eventually tracked to a lack of time for certain actions to complete, and the subsequent triggering of a previously unrecognized endless loop condition.
As a result, one of the company's core business processes was sidelined for two days. This situation wasn't caused by some esoteric combination of technologies, but by a very high usage level. When the client queried its vendors, it discovered that the products had never been tested at such high usage.
The manufacturers had concluded that such testing would not be cost-effective, and had not indicated the maximum level at which their quality assurance and capability claims applied. Interestingly, when my client re-ran the workload on a different combination of technologies at the same usage level, the problem disappeared.
I've found other examples of this thresholdof- failure phenomenon in other highly used technology combinations, including another client that runs many virtualized servers in an environment where it routinely consumes more than 95 percent of available physical hardware capacity. This company is starting to see unanticipated- failure events that seem to be random and difficult to attribute to a root cause, which makes diagnosis and prevention tough. You might conclude that this is simply too high a usage level (most virtualization software vendors would suggest staying at 80 percent usage maximum for stable operation), but financial pressure will drive more infrastructure managers to try to get closer to 100 percent. And these examples raise a more fundamental question for CIOs and technology strategists: Just how do vendors establish reliability claims in the first place?
The answer seems to be 50 percent physical testing (run to destruction) and 50 percent simulation and modeling based on the physical test data. This is good enough as long as you stay near the middle of the performance envelope, but what does it tell you about life at the edge of theoretical performance? "Not a lot" and "not enough" seem to be the emerging answers.
So next time you think about pushing the limits of your infrastructure capacity to meet your budget constraints, consider asking your suppliers what they really know about their products' limits, and be prepared to adjust your plans accordingly.