Pack-rat Approach to Data Storage is Drowning IT

“The Information Revolution is over—information won.” —Thomas Pearson, Quality Progress

One of my favorite technology trend lines derives from work originally done by researchers at IBM in the 1980s. They looked at the fundamental data storage efficiency equation—how many physical atoms does it take to reliably store a single data bit? Back in the early days of computing, a punch card stored 80 bytes of data encoded in EBCDIC (remember that?). So a card stored 640 bits using a very large number of atoms—about 10 to the 20th power atoms per bit.

By the 1970s, we had progressed to magnetic tapes and the early hard-disk storage technologies, and we were doing much better. Magnetically stored bits took only about 10 million atoms each. By calculating the storage ratio for a range of devices, the IBM research group predicted that the atoms-per-bit ratio would follow an approximately exponential improvement trend (which looks like a straight line if you use an exponential scale for the ratio) for about the next 50 years.

Today, we are tracking pretty well to their prediction—and are down to around 10,000 atoms per bit with SDLT tape and the latest high-density disk drives. The trend line reaches 1:1 around 2018 or 2020, by which time we will need nano-manufacturing processes and exotic materials to make this level of storage density work reliably, especially for long-term storage and archival retrieval media, which must preserve stored data for decades or centuries.

What’s equally interesting is to contrast the radical improvements in storage density with the equally radical growth in storage demand. As more and more business and consumer devices get digital capability, and more and more previously analog content becomes digital in origination or through conversion, storage demand is exploding. In business, government and personal life, first-world economies, and their organizations and consumers, are creating and storing new digital content in ever increasing amounts, and keeping it for longer and longer periods. There are even efforts underway to systematically organize the increasingly digital record of our society’s history: Did you know that, since 1950, over 16,000 different (and generally incompatible) storage formats have been defined and used for digital content? We now have enough calibrated growth data to predict the lower end of the volume of storage that will be needed for at least a couple of decades into the future.

Now for the bad news. Well before we get to 1:1 atoms per bit, we will need more atoms than are likely to be available if we want to store everything we will be creating and keeping. In fact, even if every atom of the earth were available to store a single bit, we still wouldn’t have enough atoms to meet the projected storage demand by 2018. That’s scary.

So let’s look at what’s going on here, and at what we might be able to do about it.

First off, we can try to reduce the unplanned or unconscious duplication of stored content. As storage costs have declined, we have generally gotten sloppy about efficient storage schemes—sometimes deliberately. We keep backup copies (at least some of us do) of much of our information because we know that the technology isn’t perfect and that we sometimes make mistakes that wipe out essential data. Often we keep several backups. We’re all going to have to get more disciplined about this—because, soon, we just won’t have enough storage space available.

Second, we tend to keep complete copies of all intermediate stages of work in process—for example, versions of a final document so that we can go back and see how the end state was achieved. Keeping only what’s changed would cut down on the storage requirements for this and many other kinds of audit trails considerably, although we (and the technologies we use) will need new habits and capabilities to work this way easily and effectively.

Third, we often send complete copies of information to a lot of people, rather than just storing the information once and sending only a link to where it is stored. This is another habit we are going to have to break, and it will be helped by the steady increase in persistent network connectivity. After all, the multiple-copies habit is really a legacy of an earlier age of connectivity deployment when we or others we worked with were often offline and needed to have frequently refreshed local copies of everything to enable us to keep on working.

Fourth, we can devise better ways of finding what we have stored. A large part of the total storage demand is associated with “meta information” rather than the information itself. We consume these additional bits for a variety of reasons, including reliable provenance, rights management and security, but a lot of the extra bits are there just to help us find and retrieve things. A great deal of attention is being paid to this problem right now, without dramatic progress, but things are steadily getting better.

Fifth, we can give up some fidelity in our information and replace the original with a “compressed” version—something we already do when we store images as JPEG files and video material as MPEGs. If the information type has the right “domain characteristics,” we can use “loss-less” compression techniques (think ZIP files or the Apple loss-less codec for music) to reduce the stored information size. When we do this we trade storage volume for codec compute cycles, and compute cycles are a less scarce resource than stored bits. The problem with a loss-less codec is that, over all data instances, as many stored items get larger as get smaller, so our systems will have to be smart enough to optimize on the fly.

Apply all of these strategies rigorously, and we can eliminate quite a lot of duplicated content—maybe as much as 40 percent, if my cursory analysis of organizations for which I have good data is a guide. Reduce duplication too far, however, and you eventually run a risk of information loss when critical elements, or their links, are damaged by technology failures, or just by entropy. You might never actually lose a bit, but if you can’t find it when you need it, it might as well have been destroyed.

For both enterprises and people, this is an area of technology strategy and practice that’s going to require a lot more medium- to long-term planning and attention, and eventually much better automation. We are going to have to develop and adhere to better policies about information lifecycle management—and even then we will only be deferring the problem. The closer we get to a digital real-time record, be it the real-time enterprise, comprehensive security surveillance or the aggregation of public Webcams, the tougher the storage demands will become, and the sooner we will run out of space. In the long run, there are only two options—store less, or find a way to get beyond the one-atom-per-bit ratio.

As challenging as it is, the second option may be the easier of the two. Reality is connected together in ways that may let us store information about real time “holographically”—sort of a combination 3D Zip file with a time dimension folded into it. And we may be able to go inside atoms and use subatomic structures as the fundamental unit of storage. I hope so, because I don’t see us giving up storing everything we can any time soon—overloaded or not.

John Parkinson has been a business and technology consultant for over 20 years, advising many of the world’s leading companies on the issues associated with the effective use of business automation. His next column will appear in October.

Pack-rat Approach to Data Storage is Drowning IT

John Parkinson

Company

Categories