By Michael Vizard
Text analytics has long been an inexact science. While applying algorithms to understand the relationship between words in order to enable a computer to understand the meaning of a document has existed for years, putting the data in a format that a computer can understand and that end users can query has been both time-consuming and problematic.
With advances in context analytics and natural language processing, it’s becoming not only easier to directly import documents as XML files that can be automatically tagged and stored in a database, but end users can ask questions about the relationships between those documents using natural language queries as opposed to relying on traditional programming languages such as SQL.
A case in point is BJC HealthCare, a health care organization that has 13 hospitals affiliated with Washington University in St. Louis. According to Tom Holdener, director of information systems for BJC HealthCare, medical students previously needed to read every document in order to categorize its contents. The document was then stored in a database, but it could only be queried using tools that required assistance from the IT department.
“As you can imagine,” says Holdener, “that was a time-consuming process.”
A new system based on an IBM DB2 database, which BJC HeathCare has been developing with IBM since 2012, now allows medical researchers to quickly determine, for example, how many patients with a certain disease might also be users of illegal drugs or tobacco products that would disqualify them from participating in a clinical trial.
“It used to take six to nine months just to work through all the data,” says Holdener.
Other use cases include making it easier to identify trends that lead to readmissions, which are now being more closely scrutinized as a way to contain health-care costs under the latest government mandates.
Not only are the relationships between keywords, phrases and ideas of different documents now more easily discoverable, but medical students who previously used to painstakingly categorize those documents are freed up to spend more time with patients, Holdener says.
While text analytics has improved greatly due to a content analytics engine based on the IBM Unstructured Information Management Architecture and more tightly integrated enterprise search capabilities, it’s still not a precise science. But the fact that text analytics technologies can automate 80 to 90 percent of the process means the amount of time needed to conduct clinical research can be significantly reduced.
In fact, IBM contends that the health-care industry is at the beginning of a journey in which natural language technologies and text analytics will be extended into a new world of cognitive computing. For example, as part of an IBM Smarter Care Solutions initiative, IBM is working with WellPoint on applying the text analytics and natural language processing capabilities that lie at the heart of the Watson supercomputer to cancer research so medical treatments are more consistently applied.
Many IT organizations rely on text analytics to mine 80 percent of their unstructured data. While the ROI on those text analytics investments have varied widely across different industries, the combination of natural language processing and more advanced text analytics engines means the technology can now be more cost-effectively applied.
“Natural language has been around for decades, but it wasn’t ready to be commonly used,” says Judith Hurwitz, president of the IT consulting firm Hurwitz & Associates. “But a lot of progress is now being made.”