We've all had to deal with the explosive growth in structured and unstructured data. But every time I develop a strategy for working with the increase in data, another discontinuity in the pattern makes my plans obsolete. The last one was the surge in unstructured data, especially e-mail and instant messaging. And we're about to hit another break in the pattern.
For years, I've worked on projects that used video image data for business purposes: analyzing movement patterns to detect abnormal behaviors, monitoring public safety from subway system cameras, and using video from traffic cameras to identify vehicles by license plate and the vehicle's make and model.
All followed the same basic idea: Take the video stream and do pattern analyses to identify specific and unexpected events. By the third example--traffic cameras--the pattern algorithms had improved significantly, allowing a lot more parallel computing, and compute cycles were getting cheap enough to do at least some of the processing in near real time.
We estimated that if we looked at all the U.S. surveillance video available in 2005 (the year of the project), we would see about 80 percent of all driven vehicles at least once.
As impressive and potentially valuable as all this is, the amounts of data involved dwarf just about anything we have seen so far in corporate IT.
A single surveillance camera generates a lot of data--up to 20 Mbps for uncompressed color. Scene synthesis--where image streams from several cameras are combined to create a 3-D scene--generates even more as "hidden" parts of the scene are filled in from comparisons of the different viewpoints.
In static scene analysis--motion-detection triggering, for example, where not much happens between events of interest--there won't be many seconds of data in each hour, or even day, so the volumes are, or have been, manageable. However, in more applications, we are monitoring a continuously changing scene and have to process and store data continuously.
Once you need this capability, you have to make some tough data management decisions, including the following:
Compressed or raw storage? Compression saves space, but most compression algorithms reduce fidelity. Will the results be sufficient for the required use? Should compression occur before or after an analysis is performed?
Indexing? Once data is stored, what sort of queries should we expect to be asked to satisfy?
What kind of archive and retrieval media? Maybe magnetic tape--a low-cost alternative to optical media or disk drives--will make a comeback, because the data structure is inherently serial. And how long should this kind of data be kept?
Security? How can you ensure that a digital image store isn't tampered with? Since this is business data, it may one day be used as evidence.
This is the discontinuity that's coming, and it's difficult to get ahead of the problem. There's a lot of excitement about the potential value of video for business applications and the continuing proliferation of cameras. Unfortunately, there isn't nearly enough focus on preparing for the consequences of storing and managing all that unstructured data.
We need to pay attention--now.