The Battle to Tame Unstructured Data

It happens all too often: one of the 480 attorneys at the Los Angeles-based law firm Sheppard, Mullin, Richter & Hampton LLP pulls aside CIO Donna Paulson and asks her why it’s not possible to do a Google-like search on the firm’s vast repositories of briefs, memos, boilerplate legal documents and e-mails. “I have to tell them,” says Paulson, “it’s not that easy. There is no magic system that can search all our databases.”

Paulson and her colleague Tom Baldwin, the firm’s chief knowledge officer, are well aware that they are trying to answer some of the most vexing questions facing corporate America and CIOs today: Is this growing heap of unstructured data worth anything to anyone? And if so, then how do we make it accessible and useful?

E-mails are just one tiny sliver of an ever-expanding universe of unstructured information that includes Word documents, instant messages, blogs, PDFs and videos—all data that falls outside the traditional database, with its orderly, defined and structured tables that are easily searched, mined and manipulated. Analyst firms routinely report that structured data is merely the tip of the iceberg: The rest of the iceberg, roughly 80 percent of all corporate data, is unstructured.

If only companies could, with the magic click of a mouse, masterfully organize and access this unstructured data. Then, perhaps, they would find golden nuggets of sales and customer information, or the name of that potential business partner in India, or maybe even the secret trick to fixing the copier when it jams. Companies’ most valuable in-house intellectual know-how—memos about the best way to market, insights from the founder—would be there for the asking. Productivity would explode, customer service would shine, and sales would skyrocket. That’s the dream, anyway.

Unfortunately, making sense of this unstructured data, as Paulson knows, is monumentally difficult. And making it useful is even trickier. Something as
basic as accurately searching archived e-mails is still beyond many standard e-mail software programs. Searching for, say, patterns of customer complaints in traditional databases can also be problematic. And even if a keyword search is available, the technology often fails to understand the concepts behind the words, so that much of the information returned is irrelevant.

“Unstructured data is growing by leaps and bounds, but just because we have a glut of data doesn’t mean every single piece is extremely important,” says Shaku Atre, principal of Atre Group Inc., a Santa Cruz, Calif.-based database and business-intelligence consulting company. “We need to identify what is important and what isn’t.”

Many companies feel that if they could just “Googlize” their internal data, the problem would be solved. And Google Inc. is all too happy to help (see Enterprise Search: Dave Girouard on Taking Google to the Corporation). But the needs of many companies go beyond simple search. Companies want to make the information actionable, see patterns in seemingly unrelated, large data pools and analyze information as a whole.

There are some complex applications for tapping into the value of unstructured data in specialized fields such as patent mining (running sophisticated algorithms to uncover competing patents), legal discovery (finding patterns of information in stacks of legal documents), and government intelligence (analyzing millions of phone calls and e-mails to identify potential terrorists).

And an auxiliary wave of more general applications are designed to reduce costs by capturing and accessing company knowledge, improving employee productivity, and effectively managing compliance issues.

But while there have been well-documented returns on investment for such work, the technology to tap into the value of unstructured data is still in its infancy, as is most of the data itself. And there is not a single, enterprisewide software package to address the problem. According to Robert Blumberg, managing director at Soquel Group and former president of Fresher Information Corp. (now Matisse Software Inc.), a company that specializes in unstructured-data management, “This is just the beginning of the technology—and you can tell because the applications are very expensive.” As such, any attempts to manage this data should have clear goals and obvious benefits. And that, ironically, is going to require some solid research. Great, more data.

Story Guide:

  • The Battle to Tame Unstructured Data
  • The Search For Meaning
  • Search Engines for Search Engines
  • The Ethics of Data
  • The Enterprise Approach?
  • Sidebar: The Data Revolution Will Be Televised

    Next page: The Search For Meaning

  • Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles