What do we mean when we talk about data modelling? An overview of different types of models

The real world, whether it be the physical world, for example machines, or the natural world, for example human and animal behaviour, is very complex with many factors, some unknown, determining their behaviour and responses to interventions. Even if every contributory factor to a phenomenon is known, it is unrealistic to expect that the unique contribution of each factor to the phenomenon can be isolated and quantified. Thus, mathematical models are simplified representations of reality, but to be useful they must give realistic results and reveal meaningful insights. In his 1976 paper ‘Science and Statistics’ in the Journal of the […]

Predictive analytics – what can you do with your results?

I talked in my last blog post about the confusion that often emerges around how much data is enough to effectively deploy predictive analytics.  I argued that sample selection is much more important than sample size when it comes to ensuring accurate results. As an example I talked about two political polls from the 1936 US presidential election. The Literary Digest used a large (2.4 million) but heavily biased sample and got the prediction badly wrong. George Gallup, by comparison, got to within 1% of the actual election result using a much smaller sample (only 50,000) but that was much […]

Predictive analytics – how much data do you really need?

When I’m talking to prospective clients something I hear a lot is ‘but we don’t really have enough data to do any data mining’. It’s a common misconception that you need vast terabytes of data to be able to do anything meaningful in terms of analytics. In fact there are a number of similar misconceptions about data mining and predictive analytics that I want to talk about in this blog post. Myth one: it’s only worth mining huge datasets It’s certainly true that many data mining projects do involve working with massive datasets and these tend to be the ones […]

Deployment of analytics – a great example of the Anna Karenina principle?

The Anna Karenina principle describes an endeavour in which a deficiency in any one of a number of factors dooms it to failure. Consequently, a successful endeavour (subject to this principle) is one where every possible deficiency has been avoided. The name of the principle derives from Leo Tolstoy's book Anna Karenina, which begins “Happy families are all alike; every unhappy family is unhappy in its own way.” In this blog post I want to focus on one particular phase of the predictive analytics process – deployment.  As the Anna Karenina principle suggests, there are an infinite number of ways […]

How the CRISP-DM method can help your data mining project succeed

I’ve worked in predictive analytics for many years and have seen that a key factor for increasing the prospects of a project being successful is using a structured approach based around a data mining methodology such as CRISP-DM (a quick declaration of interest here – I was one of the team who originally developed the CRISP-DM methodology). First published in 2001, CRISP-DM remains one of the most widely used data mining/predictive analytics methodologies. I believe its longevity in a rapidly changing area stems from a number of characteristics: It encourages data miners to focus on business goals, so as to […]

The first step in predictive analytics – understanding your data

I speak to a lot of people in organisations just starting out on their analytics journey, organisations that have started to recognise that they could make better decisions if they could find the hidden patterns and nuggets of information in their data. Data talks and you can tell very quickly if it has something interesting to say. With all the current hype around big data the irony is that, in my experience, the most common worry in the early stages of investigation is that the organisation doesn't have anything to analyse. They are waiting for a new CRM system or […]