Nine tips for effective data mining

In my career I’ve seen many examples of successful and unsuccessful data mining projects. I’m often asked how clients can maximise the chances of their project being successful and, based on the many projects I’ve been involved with over the years, I think there are 9 things that really help. When these factors are in place it always suggests to me that the project has a much higher likelihood of success.

Think carefully about which projects you take on. To maximise your chances of success try and focus on those projects which are most clearly aligned with important business issues such enhancing customer loyalty, identifying cross-selling opportunities or detecting fraud. It’s tempting just to delve around in your data for a while but exploring problems on the basis that they sound cool, ethereal or complexly challenging is a risky business. Once you’ve disappeared down that rabbit hole you could be there for a long time, whilst adding nothing to the profit margins of the business.
Use as much data as you can from as many places as possible. When you’re modelling customer behaviour patterns it helps to have access to data both on customers who are very influential and also those who may currently be less influential but are susceptible to being influenced. Over- or under-representing either group in your population will skew your model and may cause you to overlook key variables found in the under-represented segment.
Don’t just use internal customer data. If you’re limiting yourself only to your internal customer data you may be overlooking external data, such as social media activity, that could contain the variables you need to build into your churn, cross-sell, or acquisition model. Even if you’re looking at the right population, building your training set from the wrong sources means you may be inadvertently skewing your model to the most convenient variables, not the most valid variables.
Have a clear sampling strategy. You may have a powerful analytics platform that enables you to train your model from the entire population dataset. Typically, though, you’ll train it from a much smaller sample. Your sampling might be simple, focused on ensuring that you extract a representative subset of the total population or complex using more advanced techniques. Either is fine as long as you’ve considered it and have a clear strategy as to which approach you’re going to use and why.
Always use a holdout sample. A holdout sample enables you to check the predictive performance of your model over time. If you’re building models from old, inaccurate and inconsistent versions of data then they will need extra scrupulous testing on new and unseen data to make sure they stand up in the real world. Testing ensures you haven’t built a model to make a point-of-sale cross-sell recommendation on a piece of data which isn’t actually available at that point in the process (an example I have seen) or that you haven’t over-trained a model into perfectly learning the nuances of one set of data.
Spend time on ‘throwaway’ modelling. Identifying the best predictors from a wide range of independent variables is the first part of the modelling process. Throwing all the information in, testing multiple models and then refining the selection process down, all in the first day of your project, gives a leap forward in productivity. This is known as “throwaway modelling” and it’s a valuable part of the process because being able to chuck everything in, throw out the things that don’t work and keep the cream from the top means that the bias of the analyst or the slowness of programming a new routine does not interfere with the accuracy of the results. If you skip this part of the process then there’s a risk that you’ll miss an important relationship in your data that you hadn’t thought of or which doesn’t fit with your own pet theory.
Refresh your model regularly. If you think that the predictive model you’ve just built will always fit your real world data perfectly, think again. Model quality often vanishes in an instant. You may need to score your models with fresh data every month, week, day, or even every hour. Choosing the scoring and iteration frequency is essential if your models are going to retain their predictive validity over time.
Make sure your insights are meaningful to other people. Translating your insights across the organisation in terms of pictures or patterns which can be easily understood by non-statisticians is vital. The elegant model you have created may be extremely complex under the hood but knowing this will not help lesser mortals to understand and use the insight you have gained. Confuse people with statistical jargon and they won’t be able to make practical use of your findings. Make your findings clear, accessible and usable and you’ll be asked for more.
Use your model in the real world. If you don’t deploy your model into the frontline and use it to affect your business’s performance in some way then you have spent a lot of time and expertise on an interesting research project that’s had no practical impact whatsoever. Make sure that you have clear deployment routes in mind right from the start. You need to ensure that Marketing can use your cross-sell model, that Contact Centre staff can see your churn risk scores, that your acquisition modelling is being applied to new prospect campaigns. If you don’t ensure your models are deployed then you’ll never be able to demonstrate the power of your work.

Nine tips for effective data mining

About The Author

Rachel Clinton

Contact us