What kind of problems can probability models solve?
Probability models have three basic building blocks: One is timing-how long until something happens. One is counting-how many arrivals, how many purchases or whatever will we see over a given period of time. And choice-given an opportunity to do something, how many people will choose to do it. That's it. Most real-world business problems are just some combination of those building blocks jammed together. For instance, if you're modeling the total time someone spends at a Web site during a given month, you might model it as counting-timing: a count model for the number of visits and a timing model for the duration of each one. My view is that we can very easily build simple models in Excel for each of those three things. A lot of people have built this kind of model over the years, and have tested them very carefully, in some cases putting them directly up against data mining procedures. They have found that their capabilities are not only astonishing, but far better than data mining. If you think about all the different ways you can combine timing, counting and choice, you can tell all kinds of interesting stories about different business situations.
How would you use these models to identify the most profitable customers or calculate customer lifetime value?
This is where probability models can come together beautifully with data mining. We can use these models to come up with very accurate forecasts about how long this customer will stay with us or how many purchases they'll make over the next year. So use the basic probability model to capture the basic behavior and then bring in data mining to understand why groups of customers with different behavioral tendencies are different from each other. You see, behavior itself is not perfectly indicative of the true underlying propensities, which is what managers really want to know. And so we build a probability model that helps us uncover the propensities, and then we can take those propensities-the customer's tendency to do something quickly or slowly or to stay online a long time or not-and throw those into the data mining engine to explain those as a function of the 600 variables. You'll find a much more satisfying and fruitful explanation in terms of being able to profile new customers and understand the likely actions of current ones. When it comes to taking the outputs of the probability model and understanding them, data mining procedures are the best way to go.
Can probability models capture longitudinal or predictive information?
Very, very well. In fact, one of my favorite examples is looking at customer retention and return. You can do it simply without any explanatory variables at all. The irony is that if you bring in explanatory variables, in many cases the model will do worse. This makes managers crazy. They need to know why these people are different. But if you're bringing in explanatory variables that aren't really capturing the true underlying reasons for the differences, then you're just adding noise to the system. Your ability to come up with an accurate forecast for each group might actually be worse.
So you use data mining to help you figure out why those propensities exist.
That's right. The key is to explain the propensities-the tendency to do things-as opposed to the behavior itself.
You said these models can be built in a spreadsheet. It doesn't sound like you have to be a high-powered Ph.D. to create them.
Of course, that never hurts. But yes, these models are far more transparent to managers because the stories they tell are simpler, the demands on the data are far simpler, and the implementation is much easier. So what I like to do is to start people out with some of the really simple models and get people hooked. Show me how many customers we've had in year one, two, three, four, five, and I'll tell you how many we'll have in year nine and ten before we even bring in all the explanatory variables that data miners want to do.
If companies move to using models more, what data can they stop collecting and what data will they still need to collect?
Ultimately, what matters most is behavior. That shouldn't be a controversial statement, but a tremendous amount of the data that's being collected is nonbehavioral. Data on demographics, psychographics, socioeconomics and even consumer attitudes can not only waste servers and storage space but can actually make the models perform worse. I have lots of examples of data that leads to tremendously misleading inferences about what really matters.
So behavior's what matters most, and even then you can often summarize behavior in very simple ways. For instance, in many cases we find that you don't even need to know exactly when each transaction occurred to make forecasts. Simply give me summary statistics, such as frequency. Just tell me when was the last time they made a purchase and how many purchases they made over the last year, and that will explain pretty much everything worth explaining. You mentioned that a CIO Insight survey found that the amount of customer data companies are collecting is increasing at an annual rate of about 50 percent. I would claim that most of that 50 percent is completely wasted. It's one thing to have 50 percent more data, but you're certainly not getting 50 percent more knowledge or insight. In fact, you could be doing more harm than good, because you're crowding out the few variables that really do matter.
What companies have done a good job of using models this way?
I wish I could put some companies on a pedestal, but I've never seen a firm really embrace this stuff as fully as I'd like. And I'll tell you why: It's really my fault. It's the fault of academics who spend almost no time teaching these procedures. Most firms just aren't getting exposed to this stuff.
What should CIOs do to help their companies use analytical and modeling tools appropriately?
For one thing, remember that more is not necessarily better. CIOs often push back on analytics because of cost, but if someone could give them all this additional data for free, they'd take it. That's often wrong. Additional data can actually harm you because you're going to start capturing random, quirky, idiosyncratic things that aren't related to the true underlying propensities. The flipside is that a few simple measures that have been around forever, like recency and frequency, are all you need. If you can use data collection technology to get those measures more accurately or on a timelier basis, then maybe it's worth the investment. Second, remember that some surprisingly simple models can take you incredibly far if you're willing to not worry so much about drivers. Don't bother looking for the drivers; first, capture the behavior. So start simple; that often means start in Excel. You'd be amazed at how much you can accomplish without even having to leave the spreadsheet environment.