Managing Big Data: What Every CIO Needs to KnowBy Paul S. Barth, PhD
Managing Big Data: What Every CIO Needs to Know
Advances in information technology over the past five decades have been nothing short of breathtaking. Consider: We are now able to place sophisticated processers, enormous memory, and high-bandwidth networks into inexpensive, yet powerful, consumer products--effectively connecting billions of people and products around the world.
While this offers tremendous opportunities, it also creates some difficulties, for computing on such a vast scale generates data at rates faster than can be managed. Which is why, though storage costs less and less every year, many large companies are experiencing increased total storage costs. One large financial-services company, in fact, saw its data stores grow from four to 40 petabytes in just the last two years.
Welcome to the "Big Data" era. In many ways, big data is a new frontier--a continuously connected marketplace of consumers and companies, from which communications and activity can be mined to deliver personalized, relevant offers and messages, all executed with unprecedented speed, automation, and intelligence. The opportunities are vast.
Experienced CIOs see this opportunity in context. They know that leveraging big data to deliver real business results will require a focused strategy that leverages and protects their existing data assets, develops new capabilities that are production-ready and reusable, and is able to manage the deluge of new data that will be created in the process.
Our work with Fortune 100 CIOs has revealed several key ideas for dealing with the both opportunity and challenge of big data.
Data, Information, or Insight?
For many companies, the recent explosion in data is not a result of increased business transactions or better use of information and analytics. Rather, it's the result of unmanaged replication. Large email attachments that are broadly distributed, hundreds of extracts from production systems sent nightly to departmental databases, and unclear archive-and-purge processes drive data growth without creating any new information. The value of big data comes from new information and insights, not copies of existing data, and there are three main ways in which to get started down the right path.
The first task is to separate the signal from the noise. First, begin reducing the noise by locking down and simplifying the data environment with ILM (information lifecycle management), data governance, and master data management. This does not mean waiting to get started on big data; rather, plan to retire two copies of legacy data for every new data source created.
Second, it is critical to identify (even broadly) what new information and insights big data can provide and how that will impact the business. We've done a number of case studies that serve to illustrate this in action. Some of the actions you can take include:
Voice of the Customer: Summarize call-center and customer e-mail correspondence nightly with text mining tools to prioritize top product and service issues and desired features.
Accelerate Analytic Processes: Create a multi-terabyte "analysis-ready" database to support common analytic needs, such as customer marketing segmentation. One company accelerated their go-to-market processes by an order of magnitude with this technique.
Business Event Detection: Design channels to identify important business events during interactions with customers and automate responses. At a large insurer, for example, timely, targeted responses to customer behavior based on such identification improved close rates 20 percent and increased retention 10 percent.
Third, define the smallest possible scope for success. Be rigorous in defining the new information that is needed, and then decide if big data is the only source. If it is, then assess the smallest set of data required to generate that information. Ask questions such as: How much history is needed for trend analysis? How granular is the data needed?
For discovery and analysis projects, statistical samples can produce the same insights as can full-volume historical data sets. Most large companies try to understand coarse, consistent patterns in customer behavior and product performance so they can optimize their business processes and products at scale. An analysis of 500,000 random car buyers will yield just about the same insights as 50,000,000. Unless your business can take advantage of micro segmentation and harvesting "the long tail," a rigorous sampling and analysis process will yield sufficient actionable insights.
Leverage Big Data Technology
Networked, dynamic business processes built at a very granular level can produce billions and trillions of bytes of data each month. Given all this, it must be understood that the demands of big data have traditionally outstripped any improvements in technology cost/performance. Fortunately, new architectures and approaches have evolved over the last decade that can simplify managing these enormous data volumes, approaches that are finally being incorporated into the enterprise architectures of many large companies. These include:
Database Appliances and Accelerators: Relational database technology has evolved dramatically over the last decade, allowing terabytes, even petabytes, of data to be loaded and queried quickly and efficiently on a single platform. Database appliances bundle storage, processing, interconnects, and query processing onto a dedicated hardware and software platform optimized for database performance and management. Database accelerators use innovative storage and query optimizations to reduce database size and accelerate complex query performance. Where hardware upgrades on traditional relational databases might improve performance by a factor of two, appliances and accelerators can improve price-performance by a factor of 100. Most important, these technologies simplify management and administration by eliminating the need for expert tuning and configuration.
NOSQL Data Stores: A technology literally born from the Internet, Not-Only-SQL technology was designed from the start to manage enormous, distributed data sets that can be queried in milliseconds. Instead of normalizing data into relational tables that are then joined for answers, very large data sets are distributed across hundreds or thousands of processors, organized so that related data is stored together. Queries run in parallel across all processors, each returning answers based on its local data. This incredibly simple and scalable approach is very efficient and flexible, allowing for a wide variety of data types to be stored together, as well as sophisticated queries to be run.
Automated Analytics: Harvesting insights from big data requires analytics, and, in most companies, this is the domain of a small number of highly trained specialists. Capturing, cleansing, and combing through terabytes of data is often more art than science, and most analysts will tell you that their manual processes cannot be automated. However, over the last decade, advances in self-learning algorithms, genetic algorithms, and automated testing have produced programs that discover patterns, generate insights, and improve over time--in other words, they learn. These systems might not always outperform their human counterparts, but their automated processes might be the only way to scale to the demands of big data.
An important role for big-data technology in the enterprise is to reduce data volume and complexity. For example, most large companies have good processes for managing an important business event--such as responding to potential fraud or calling for service. Big-data technology allows the scanning of billions of transactions to identify or anticipate an important business event; once identified, traditional technology and processes can be launched in response. In this way, big-data technology acts as a transducer that translates the cacophony of data into useful, manageable information.
Big data must be considered in the context of the enterprise data and analytics environment, which we think of as an ecosystem: capturing and creating data, cleansing and organizing it, mining business insights from it, and using those insights to drive intelligent actions in the business. By feeding data that measure the outcomes of these actions back into the system, a closed loop is created that allows companies to use their data to test, learn, and improve their processes.
The diagram below depicts three broad domains of the ecosystem: data, insight, and action.
Data capabilities are responsible for creating and managing usable, high quality enterprise information assets. These include all standard data-management capabilities, such as data sourcing and integration; quality and metadata management; data modeling; and data governance. Insight capabilities include tools, data, and processes for management reporting and advanced analytics. Action capabilities provision data and business intelligence to applications, business processes, and business partners, and capture responses to interactions.
Big data presents opportunities and challenges in each of these domains. Data management leverages big-data technology to eliminate redundancy and provide scalable infrastructure for managing big-data assets. Insight uses appliances and accelerators, NOSQL technology, and automated analytics to expose new value hidden in big data. Businesses deploy these insights through intelligent agents mediating both internal and external communications and interactions.
Big data presents fascinating opportunities for insight and innovation--as well as the challenge of separating the signal from the noise. Increasingly, companies are overlaying their internal, proprietary data with insights from external structured and unstructured data to better understand their customers, performance, and marketplace. New technologies are making big data useful and manageable, but careful, business-driven planning and governance are essential to success. Starting from clear business objectives, enterprises are evolving to manage the dramatic growth in data, harvest new insights, and continuously optimize their actions.
About the Author
Paul Barth, PhD, Managing Partner/Founder, NewVantage Partners.