The first step in predictive analytics – understanding your data

I speak to a lot of people in organisations just starting out on their analytics journey, organisations that have started to recognise that they could make better decisions if they could find the hidden patterns and nuggets of information in their data. Data talks and you can tell very quickly if it has something interesting to say.

With all the current hype around big data the irony is that, in my experience, the most common worry in the early stages of investigation is that the organisation doesn't have anything to analyse. They are waiting for a new CRM system or data warehouse and hoping that the data will somehow magically all be in there. It's also common to find that data is liberally spread throughout the organisation in spreadsheets and other ad hoc sources. 

The truth is that there doesn’t need to be anything approaching a data warehouse already in place before you start, and it's generally possible to make sense of data from many different sources, no matter how unpromising it may look (as long as it’s not literally a warehouse full of filing cabinets of data, which I promise you I've been faced at an exploratory meeting before). Even when people insist there isn't any usable data in their organisation, it always turns out that there is. So, don't worry about now having enough data for predictive analytics. There will be data. 

Recently I worked with company keen to improve its customer repurchase rates. They were worried about whether they had sufficient data of the right quality accessible within their organisation. Together we went through a process 'auditing' the data they had available to them and in this blog post I will share some of the things that we considered as they may be useful to other organisations doing the same thing.   

What different types of data do you have?

The first stage of any predictive analytics project is to get to grips with what data already exists within the organisation and of what type it is. There are four main types of data which can be useful in predictive customer projects. 

1. Descriptive data

Normally this is 'self-declared' geo-demographic information of the sort which might be collected on an initial application form or via a follow up customer survey. Common descriptive variables are things such as age, gender, postcode, income level, family group status and marital status – variables which describe some attribute of the customer.

2. Interaction data

This data tells us something about how an individual has interacted with an organisation. Typically interaction data might be gathered from visitor tracking systems on websites and then information calculated from this. For example interaction data might tell you when someone last visited your website, how frequently they visit, which pages they looked at and so on. In the bricks and mortar world a hotel might track how many times a guest has stayed with them, how long they stay for, who stays with them and whether they have dinner or not when they stay.

3. Sentiment data

Sentiment data is also generally 'self-declared' rather than being always structured in the rows and columns of a spreadsheet may instead be free text based on verbatim responses to satisfaction survey questions or sentiments expressed in tweets, Facebook comments or complaints letters, for example. Free text data can be extremely useful and can add a lot of colour to your analysis. It can be analysed in the same way as row / column data using text analysis techniques.

4. Behavioural data

Sometimes this overlaps with interaction data but is information taken from the initial sign up process, point of sale systems, tills and entry points at venues. For example products purchased, payment method used, renewal behaviour, account status and so on.


You don't have to have all of these types of data or indeed all these exact data fields. The more data you have, the more there is to test for importance and influence, but many customers have achieved strong initial results from what can appear to be quite limited ranges of data.

What if I have lots of missing or incomplete data?

It’s also very common to worry about missing or incomplete data. You may have years and years of customer data in total but it may be incomplete or some of it may be of poor quality.  My experience is that some data is better than no data and most data, however unpromising it may look at first glance, can usually be cleaned and worked with. New fields can be calculated, for example calculating age from date of birth, and there are also more sophisticated methods of estimating or inferring completely missing values.

Sampling is also an important consideration. The most robust and reliable models use samples of data in order to build models and then hold back other samples to test and refine the models. This means that you do not need to have a complete transaction history over many years in order to start building predictive models.

I mentioned earlier that it's common for organisations to hold lots of different types of data in numerous different places, often in completely different file formats. These days that doesn't need to be a major concern – most analytics tools are good at connecting multiple sources and multiple formats for the purpose of analytics.

Making your data talk to you is really what it's all about. If yours doesn’t talk and add value so you can use it to make decisions then maybe you don’t need to worry about where it is but rather about what data is being collected in the first place. 

Download your free copy of our Understanding Significance Testing white paper
Subscribe to our email newsletter today to receive updates on the latest news, tutorials and events, and get your free copy of our latest white paper.
We respect your privacy. Your information is safe and will never be shared.
Don't miss out. Subscribe today.
WordPress Popup Plugin
Scroll to Top