When I first started running programs that dealt with big data--meaning both a lot of data about something or somebody and a lot of things or people to have some data about--big was actually pretty small.
I once built a system for a modern 300-bed hospital that ran everything (including patient records for half a million people) on less than 10GB (yes, you read that right) of high-performance disc storage.
It's interesting to note that the performance of today's comparatively larger storage arrays isn't intrinsically much better than I was getting in 1980--maybe twice as fast for data retrieval. There's just a lot more data being stored, and the cost per stored bit is way down. Some of the same operational challenges are still around, too.
- First, data quality remains an issue. The more data you accumulate, the harder it is to keep everything consistent and correct. We have invented whole new areas of focus (master data management) and tools to deal with the garbage in/garbage out problem, but it's not getting any easier. With really large data sets accumulated over time (which means that things change--what was once correct isn't any more, and vice versa), you have to solve for garbage in/gold out and prevent gold in/garbage out.
- Second, adequate data characterization (metadata to the geeks) is critical. How you deal with data -- even how you choose to organize its storage -- requires you to know how much data there is going to be and how fast it's likely to grow and change. A query that runs well to find 100 rows in a million-row table may not run well on 100 billion rows. It matters how you flag and track errors. Logging and auditing matter if the data changes frequently--less so if the data is essentially static.
- Third, interpretation remains more of an art than a science -- or a science accessible to only a few trained specialists. Software developers have had to design efficient filters and pattern recognizers that can sift through mountains of data and find (perhaps unanticipated) patterns that are relevant to a dimension of interest.
- Fourth, data visualization -- representing results in an easily consumable form -- is critical. What good is all that data if you can't understand what the interpreters--human or software--concluded from their analysis. Data visualization design theory isn't new but, like many things that involve deep understanding of the range and vagaries of human cognition, it's hard to do well.
- Fifth, you're generally going to have to choose between a real-time view of the data (which may mean that you have to continuously recompute everything whenever the data changes) and a complete but retrospective view (the most common state of cube-based analytics), which will always be somewhat out of date.
- Sixth, how do you know in advance how long the data is relevant or valuable? Data costs money to acquire, store, analyze and back up. A retention policy beyond a typical "keep everything forever" approach is needed, and that policy has to be enforced.
It's probably best to start from the value end of the equation and keep only what you are sure you will need. After all, someone else is probably keeping everything else for you already.
About the Author
John Parkinson is head of the Global Program Management Office at AXIS Capital. He has been a technology executive, strategist, consultant and author for 25 years. Send your comments to firstname.lastname@example.org.