Predictive analytics – how much data do you really need?

big data word cloud

When I’m talking to prospective clients something I hear a lot is ‘but we don’t really have enough data to do any data mining’. It’s a common misconception that you need vast terabytes of data to be able to do anything meaningful in terms of analytics. In fact there are a number of similar misconceptions about data mining and predictive analytics that I want to talk about in this blog post.

Myth one: it’s only worth mining huge datasets

It’s certainly true that many data mining projects do involve working with massive datasets and these tend to be the ones that get the publicity. Hence it’s common for potential clients to assume that data mining and predictive analytics isn’t for them unless they have a Tesco’s clubcard-sized database. It’s also true that many of the statistical techniques commonly used in predictive analytics projects were developed with very large datasets in mind.

However neither of these facts mean that organisations with very much smaller datasets cannot get any value from mining that data and using it for predictive analytics. On the contrary, some of the clients I have worked with have gained very valuable business-changing insights from comparatively small datasets. Even if you are planning to build a huge database, it can be useful to do some basic analytics on your data while it’s still fairly contained, size-wise. Such analysis can often tell you that you actually need to be collecting different data, or structuring your data collection in a different way, and gaining this knowledge relatively early in the process can save you a lot of time and money down the line.

Myth: your data mining will be more effective if you include all the data you possibly can.

In fact you should only include a data item if it genuinely contributes to understanding of the question you’re trying to answer. Throwing everything into the pot and hoping for the best can actually end up reducing the predictive power of your models, making your data mining less effective rather than more. This is particularly the case if you include irrelevant data or multiple measures of the same thing in your dataset. For example, including variables for both age and date of birth (effectively two measures of the same thing) in your dataset means that your modelling tool will find that they are both equally significant in predicting whatever it is you’re trying to predict, and hence will assign each a lower weighting as a predictor, reducing the utility of your model.

Myth: you shouldn’t build models based on a sample of your data

It’s often argued that if you build a model based only on a sample of your data then you lose all the information in the data that isn’t included, hence reducing the model’s power. However if you’re using analysis of your existing customer data in order to help with the acquisition of new customers then your analysis is already based on a sample (existing customers) of the whole population (potential customers). 

There may also be other situations in which you have no choice but to conduct your analysis based on a sample of all the data that you hold. For example, you may find that you hold data that is not relevant to the particular problem you’re working on. It’s common for data warehouses to include substantial volumes of historical data reflecting conditions that no longer apply, so it wouldn’t be helpful to include that data in any model that you hope to use for future decision-making. 

You may also find that it simply isn’t practical to indulge in full-scale data collection. For example, perhaps you want to find out how satisfied your customers are with your services but administering each survey takes a long time. It makes sense to limit your analysis to a sample, as the cost of collecting the additional data will outweigh any potential benefit it might give you.

Myth – sampling is all about size

Effective sampling is about maximizing the about of information that you gain from each unit of effort that you expend. A small random probability sample, as long as it is truly random and not biased in any way, can have very high predictive power. Take as an example political opinion polls. There are around 46 million registered voters in the UK but opinion polls are usually based on a sample of around 1,000 of those. Assuming that sample is truly random then the results are usually accurate to within 1-2 percentage points. Administering a poll to a larger sample would add to the cost but without improving the accuracy of the poll enough to make the additional cost worthwhile. When opinion polls get it wrong it’s generally because of sampling error.

A famous case of this was the Literary Digest’s straw poll during the 1936 US presidential election. The Literary Digest was a general interest weekly magazine that, based on straw polls, had correctly predicted the result of the previous four presidential elections. In 1936 its poll predicted that the landslide winner would be Governor Alfred Landon of Texas. Of course in the event the 1936 presidential election was won by Franklin D Roosevelt. The Literary Digest was completely discredited and ceased publishing soon after.

So what went wrong? The prediction was based on a poll of around 10 million voters, of whom about 2.4 million replied – a huge sample – and so the magazine was extremely confident that its prediction was accurate. However, more important than the sample size is its composition. In this case the magazine’s sample was taken from three sources of data – its own readership, a database of car owners, and a list of people with telephones – and therein lay the problem. All three groups were disproportionately wealthy compared to the average and therefore more likely to vote republican. Hence the poll massively over-estimated how likely the republican candidate would be to win. In contrast the results of the same election were correctly predicted by George Gallup who got to within 1% of the actual election result using a sample of only 50,000, showing that sample selection is much more important than sample size when it comes to ensuring accurate results.


So the reality is that building effective predictive models does not necessarily require very large data volumes. In fact often less is more. Of course, as we have talked about before on this blog, what really determines the success or otherwise of a predictive analysis project is deployment, and that’s something I will talk about in more detail in my next post.