The Battle to Tame Unstructured Data

By Eric Pfeiffer  |  Posted 05-30-2006

The Battle to Tame Unstructured Data

It happens all too often: one of the 480 attorneys at the Los Angeles-based law firm Sheppard, Mullin, Richter & Hampton LLP pulls aside CIO Donna Paulson and asks her why it's not possible to do a Google-like search on the firm's vast repositories of briefs, memos, boilerplate legal documents and e-mails. "I have to tell them," says Paulson, "it's not that easy. There is no magic system that can search all our databases."

Paulson and her colleague Tom Baldwin, the firm's chief knowledge officer, are well aware that they are trying to answer some of the most vexing questions facing corporate America and CIOs today: Is this growing heap of unstructured data worth anything to anyone? And if so, then how do we make it accessible and useful?

E-mails are just one tiny sliver of an ever-expanding universe of unstructured information that includes Word documents, instant messages, blogs, PDFs and videos—all data that falls outside the traditional database, with its orderly, defined and structured tables that are easily searched, mined and manipulated. Analyst firms routinely report that structured data is merely the tip of the iceberg: The rest of the iceberg, roughly 80 percent of all corporate data, is unstructured.

If only companies could, with the magic click of a mouse, masterfully organize and access this unstructured data. Then, perhaps, they would find golden nuggets of sales and customer information, or the name of that potential business partner in India, or maybe even the secret trick to fixing the copier when it jams. Companies' most valuable in-house intellectual know-how—memos about the best way to market, insights from the founder—would be there for the asking. Productivity would explode, customer service would shine, and sales would skyrocket. That's the dream, anyway.

Unfortunately, making sense of this unstructured data, as Paulson knows, is monumentally difficult. And making it useful is even trickier. Something as basic as accurately searching archived e-mails is still beyond many standard e-mail software programs. Searching for, say, patterns of customer complaints in traditional databases can also be problematic. And even if a keyword search is available, the technology often fails to understand the concepts behind the words, so that much of the information returned is irrelevant.

"Unstructured data is growing by leaps and bounds, but just because we have a glut of data doesn't mean every single piece is extremely important," says Shaku Atre, principal of Atre Group Inc., a Santa Cruz, Calif.-based database and business-intelligence consulting company. "We need to identify what is important and what isn't."

Many companies feel that if they could just "Googlize" their internal data, the problem would be solved. And Google Inc. is all too happy to help (see Enterprise Search: Dave Girouard on Taking Google to the Corporation). But the needs of many companies go beyond simple search. Companies want to make the information actionable, see patterns in seemingly unrelated, large data pools and analyze information as a whole.

There are some complex applications for tapping into the value of unstructured data in specialized fields such as patent mining (running sophisticated algorithms to uncover competing patents), legal discovery (finding patterns of information in stacks of legal documents), and government intelligence (analyzing millions of phone calls and e-mails to identify potential terrorists).

And an auxiliary wave of more general applications are designed to reduce costs by capturing and accessing company knowledge, improving employee productivity, and effectively managing compliance issues.

But while there have been well-documented returns on investment for such work, the technology to tap into the value of unstructured data is still in its infancy, as is most of the data itself. And there is not a single, enterprisewide software package to address the problem. According to Robert Blumberg, managing director at Soquel Group and former president of Fresher Information Corp. (now Matisse Software Inc.), a company that specializes in unstructured-data management, "This is just the beginning of the technology—and you can tell because the applications are very expensive." As such, any attempts to manage this data should have clear goals and obvious benefits. And that, ironically, is going to require some solid research. Great, more data.

Story Guide:

  • The Battle to Tame Unstructured Data
  • The Search For Meaning
  • Search Engines for Search Engines
  • The Ethics of Data
  • The Enterprise Approach?
  • Sidebar: The Data Revolution Will Be Televised

    Next page: The Search For Meaning

    The Search for Meaning


    The Search for Meaning
    It is not uncommon for the attorneys at Sheppard Mullin to generate up to 2,000 e-mails during a merger or acquisition deal. In years past, these messages, many of which covered important aspects of the transaction, were simply archived with tens of thousands of other e-mails. More than a few were deleted outright. When the time came to find a particular message, attorneys wasted enormous amounts of time searching for it. Quite often, they never found it. "It was a problem, and the partners said we needed to address it," Baldwin says.

    Many attempts have been made to put a dollar figure on the productivity loss associated with looking for old e-mails or tracking down important company memos. For instance, the Nielsen Norman Group, a consulting firm in Fremont, Calif., estimates that a company with 10,000 employees can save nearly $2.5 million by just improving search on its intranet. But these estimates vary widely and often are so arbitrary that "they just become noise people ignore," says Stouffer Egan, chief strategy officer in the U.S. for Autonomy Corp. plc, a Cambridge, U.K. based software company that specializes in managing unstructured data. That said, lost productivity is a "big problem, and all organizations are aware of it," he adds.

    CIO Paulson and Baldwin looked at what they saw as the easiest, most cost-effective solution to their problem: desktop search. They looked at numerous free applications, but, says Baldwin, these technologies didn't offer simple features such as a preview of documents or highlighting of search terms.

    Instead, they chose Pasadena, Calif.-based X1 Technologies Inc.'s enterprise desktop search. The product quickly indexes data—in this case e-mails—on a computer and then displays a portion of the e-mail. With a click of the mouse, the user can call up the entire message. In 2005, the firm deployed the desktop application to 900 users. Baldwin says that the financial outlay, about $90,000, has paid for itself in increased satisfaction and productivity. Now, when a question arises during an M&A transaction, attorneys can do a quick search and see all the e-mails related to that specific deal. "Our lawyers have become addicted to it," Baldwin says.

    While effective, Sheppard Mullin's new desktop search does not let the attorneys search the firm's real trove of information. Each time an attorney creates a brief or plea in Microsoft Word, it is automatically deposited in the firm's content-management system, which, to date, has one million such documents. As with Microsoft Outlook, the standard search features that came with the firm's content-management system "weren't effective," says Baldwin.

    Worse, the problem compounded itself. Attorneys working on a brief would often send out all-company e-mails asking if someone else at the firm had written a similar brief they could draw upon, thus creating yet more unstructured data. The e-mails, not surprisingly, were routinely ignored. In a pilot project, slated to be completed this spring, Baldwin and his colleagues will finish integrating another X1 product with Sheppard Mullin's content-management system that they expect will provide accurate indexing and searching capabilities. When completed, Sheppard Mullin attorneys hope that they will no longer have to "reinvent the wheel when they write a brief," says Baldwin. "We will be able to reuse some of our own intellectual property."

    Story Guide:

  • The Battle to Tame Unstructured Data
  • The Search For Meaning
  • Search Engines for Search Engines
  • The Ethics of Data
  • The Enterprise Approach?
  • Sidebar: The Data Revolution Will Be Televised

    Next page: Search Engines for Search Engines

    Search Engines for Search

    Engines">

    Search Engines for Search Engines
    Sheppard Mullin's need to capture and reuse its intellectual capital is something Richard West can relate to. West is in charge of organizational and e-learning initiatives at BAE Systems plc, the massive London-based defense contractor. With more than 90,000 employees worldwide, the company has some predictable problems managing data. In 1999, BAE had more than ten different search engines that trolled its vast universe of unstructured data, including everything from product specs to customer information and manufacturing processes. The situation was so out of control that "we needed a search engine to find the right search engine," jokes West. These engines used simple keyword searches that were not accurate, and employees would sometimes spend half an hour just sifting through all the search results. West's belief that 90 percent of any organization's knowledge is locked inside its workers' heads, led to his conviction that there were millions to be saved by tapping into the company's brainpower, and cutting down on duplicated work.

    To solve the problem, BAE Systems went with Autonomy's pattern-recognition software. The technology uses sophisticated algorithms to analyze vast amounts of unstructured data based on natural language queries, and then extracts not just keywords, but actual concepts or ideas behind those words. "It is a very advanced mathematical way of figuring out what 20 words mean in context," says Egan.

    The technology gives BAE engineers the ability to type in complex queries ("Who is currently working on new ways to use bonding rivets on airplanes?") and get the results they need across numerous databases and intranets. In one case, a group of engineers at BAE Military Aircraft Group were looking for a better way to bolt wings to an aircraft's fuselage, and after searching the company's stored knowledge, the engineers discovered that colleagues at another BAE affiliate had already solved the problem. They were quickly able to adopt the manufacturing process, which saved the company millions of dollars, says West.

    To date, Autonomy's technology is linked into most of BAE's key business systems, and over the past seven years it has contributed to estimated savings of £65 million (U.S. $120 million) by reducing duplicative efforts. "We have formed a virtual university, a place to go to find out information and know-how," says West.

    Story Guide:

  • The Battle to Tame Unstructured Data
  • The Search For Meaning
  • Search Engines for Search Engines
  • The Ethics of Data
  • The Enterprise Approach?
  • Sidebar: The Data Revolution Will Be Televised

    Next page: The Ethics of Data

    The Ethics of Data


    The Ethics of Data
    Another driver of dealing with unstructured data is the ever-increasing burden of regulatory compliance. Aungate, a division of Autonomy, is specifically designed to help companies identify employees who may be breaking the law, or at least company policy. Its technology analyzes vast quantities of unstructured data such as e-mail and PowerPoint slides, and looks for exceptions——for users who are doing something outside the norm.

    "Ninety percent of information is irrelevant, but that other 2 percent could put you out of business," says Ian Black, managing director of Aungate. One of his company's clients, ABN AMRO Holding N.V. banking group, which has 3,000 branches in more than 60 countries, uses Aungate technology to analyze communications in real time. "They watch for internal messages and check them for compliance," says Egan.

    The technology first analyzes a group of e-mails that have been identified as problematic—say, investment bankers who are passing insider information to their equity counterparts. Through sophisticated pattern recognition, it then compares new e-mails to this original group and flags those that are suspicious.

    The technology is also being used to comply with requests from federal and state agencies. CIOs, after all, don't want to find themselves in the crosshairs of Uncle Sam. Some companies think that the best defense is a lack of evidence, and so will try to erase historical data at regular intervals. But experts warn that the opposite approach is actually more effective.

    Many unstructured-data experts well remember last year's $1.4 billion judgment against Morgan Stanley for acting illegally in the 1998 sale of Ron Perelman's Coleman Co. camping-gear company to Sunbeam Corp. Some claim the judgment was a direct result of the defense's inability to produce relevant e-mails and documents the court demanded. As one analyst noted, "During the pretrial discovery process, and the trial itself, Morgan Stanley kept stumbling on old, hard-to-search backup tapes and couldn't perform effective searches on its newer e-mail archive." Eventually, the judge became so frustrated with the delay that she ruled against Morgan Stanley.

    When the government comes looking for information, CIOs will be "forced to deal" with unstructured data, says Brian Babineau, a research analyst for the Enterprise Strategy Group in Palo Alto, Calif.

    Story Guide:

  • The Battle to Tame Unstructured Data
  • The Search For Meaning
  • Search Engines for Search Engines
  • The Ethics of Data
  • The Enterprise Approach?
  • Sidebar: The Data Revolution Will Be Televised

    Next page: The Enterprise Approach?

    The Enterprise Approach

    ?">

    The Enterprise Approach?
    Bill Pieroni, global CIO of Chicago-based insurance giant Aon Corp., smiles when he thinks of all the search products that have "enterprise" in their name. "The market has yet to produce an end-to-end enterprise solution," he says. "You can't just add water and stir."

    Pieroni is in the middle of an ambitious project to pull structured and unstructured data together from four different data repositories at Combined Insurance Co. of America, an Aon wholly owned subsidiary. The goal? To give the company's call center a more accurate and timely look at a customer's complete profile, including e-mails sent and received, voice messages, and historical documents. "When the operator answers, we need all the information presented," he says. By knowing a customer's complete history, including preferences and needs, Pieroni also wants to improve cross-selling.

    Aon's project, which will begin beta testing in the third quarter, has required "no less than several one-off solutions," Pieroni says, because no one system could handle and integrate images, voice mail and contracts. "Clearly the need outstrips the market's ability to serve." Which is why Pieroni is testing the technology on a subsidiary before rolling it out to the rest of Aon. "We needed to focus on a narrow set of data. It is a manageable sandbox."

    The technology designed to address the too-much-information syndrome is undeniably improving. As one expert said: Companies that ignore these new and powerful tools do so at their own peril. At the same time, however, the problem is growing faster than the solution, which means that a well-thought-out plan is crucial to choosing the data that will be most helpful in achieving a strategic business initiative.

    Eric Pfeiffer is a freelance business and technology writer based in San Francisco.