There’s been a fair amount of discussion recently as to whether or not the whole big data analytics agenda has entered the ‘Trough of Disillusionment’ yet. The reality is that for many of us working in the advanced analytics arena, discussion of big data disappeared into the ‘Valley of Please-God-No-More’ some time ago. By that I mean it is in no way a dead horse, but good grief has it been thoroughly flogged.
If you’re really unlucky, and have in fact spent the past couple of decades applying analytics to data sources of all shapes and sizes across all sorts of industries, then you’ll know just how wearying it is to sit through yet another ‘thought leader’s’ declaration of how many new exabytes of data have been generated since breakfast. But the real hackles rise when the subject turns to big data and its exploitation by advanced analytics.
A standard complaint voiced regularly by the analytics community is that all this fuss over business applications of predictive modelling is nothing new. In fact it’s been going on for decades – look at the financial and telecommunications sectors. But really, that’s not a criticism at all – the whole point about big data is that there has been an exponential growth in electronically stored data since the development of cheap computing and that this growth represents both challenges and opportunities. No-one with any sense has a problem with that simple assertion.
However, when the subject of analytics comes up, suddenly we’re in the realm of half-truths where ‘the more data you have the better your models will be’. There’s so much wrong with that statement that it’s hard to know where to begin. First of all, what’s meant by ‘better models’? Anyone with experience of working in the sober realities of using data to make predictions knows that there are times when a model that is 78% accurate is deemed to be ‘better’ than one that is 85% accurate. Why? Well, for lots of reasons -maybe it’s simpler, maybe it makes more sense, perhaps it doesn’t rely on information that is only available at the last moment or maybe it’s likely to have a longer lifetime utility and maybe, whisper it, it’s based on a smaller sample and uses fewer fields.
This brings us to the second part of that uncontested statement – what’s meant by ‘more data’? Are we talking about higher granularity, where data about your mobile phone’s usage, status and location is uploaded to a warehouse every 5 minutes? Or about history, where a retailer’s daily sales of men’s clothing is recorded since 1911? How about dimensionality, where multiple data sources that record transactions, web usage, call centre interactions, social media posts and survey responses are combined. I’m well aware of attempts to describe big data in terms of volume, variety, velocity (ad nauseam,) but just because you’ve got ‘more data’ doesn’t mean you have to exploit all of it.
We’re all aware that you can have too little data to make accurate predictions but can you ever have too much data? Forecasting applications regularly exclude older data points because they don’t make for more accurate forecasts, similarly data analysts have to aggregate highly granular data just to get it into shape in order to build decent models; also many algorithms struggle to produce stable predictions when too many columns of data correlate internally with each other (an issue known as co-linearity). Lastly, if you know anything about sampling theory, you might not be shocked to discover that building a model based on 10 million records may not be much better than building one based on 10 thousand (although you better not be in a hurry because it will take a lot longer).
Let’s not forget that there is a long and respectable history of driving innovative analytical applications that deliver compelling insights and accurate predictions from lousy databases and disparate files with incomplete, duplicated and generally messy data where the volumes were relatively modest. In analytics, as in anything we do, we always need to balance the opportunity with the challenge.