Testing versus inferring

This is the second post in our ‘eat your greens’ series – a back to basics look at some of the core concepts of statistics and analytics that, in our experience, are frequently misunderstood or misapplied. In this post we’ll look in more depth at the concept of testing versus inferring.

One of most daunting aspects of statistics that new analysts encounter is the sheer number of statistical tests that are available to researchers. The statistical program R for example, is supported by an open source repository of over 10,000 packages available for download. Each package is designed to provide specific analytical capabilities for particular problems.

New students of statistics aren’t expected to familiarise themselves with such a vast library of analytical tools. Indeed, they are usually introduced to the same limited menu of classic routines that attempt to address generic questions of statistical inference. These routines take the form of comparing group proportions or mean values in order to infer whether any of the observed relationships are also likely to be present in the population rather than just the sample from which they are drawn.

The key term here is ‘infer’ rather than ‘test’. Indeed, it might be better if the word ‘test’ was never used to begin with. Most people are aware that tests, such as those used to detect diseases like COVID-19, are prone to giving false negative (in statistics, a Type 1 error) or false positive results (in statistics, a Type 2 error). But people are also used to regarding tests in the real world as pretty infallible like smoke detectors or maths exams.

In statistical testing, the outcomes are ‘detected’ based on probability values (commonly called ‘P-values’). The probability threshold at which the detection is made is known as the ‘alpha level’. In the 1920s, the founding father of modern statistics, Ronald Fisher, proposed that this threshold value (the alpha level) should be arbitrarily set at 0.05 (5%) or 0.01 (1%) depending on the circumstances of the test.

Fisher also introduced the notion that if the probability value fell below the established alpha level (e.g. 0.05), the results could be regarded as ‘statistically significant’ (another troublesome term that I discussed in a previous blog). The regrettable upshot of all this is that for decades a vast army of students, researchers and analysts around the world have been encouraged to focus almost solely on the probability values generated by an arsenal of available statistical ‘tests’.

Furthermore, if the results show a value below 0.05, they’ve been taught to regard this as a ‘statistically significant’ discovery, almost as if they’d pulled the handle on a fruit machine and hit the jackpot. In reality, statistical inference is about measuring a likelihood given the evidence rather than detecting the presence of something.

It’s not untypical for research papers to contain phrases such as “the test indicated a significant relationship between salary and job satisfaction (p < .05)”. Perhaps a more transparent way to report the outcome would be to say, “statistical inference estimates that the probability of observing an outcome as strong as this, if in fact there was no relationship between salary and job satisfaction, is 0.02 (or 2%)”.

The language here implies that it’s possible to observe relationships or differences in data, even when those same relationships don’t exist in the wider world. Some might argue, the very fact that these procedures are called ‘tests’ encourages people to think of them rather like pregnancy tests where the results are clearly displayed and rarely wrong.

In fact, the statistical community is painfully aware that the wholesale adoption of a 5% cut off value (an alpha level of 0.05) to accept or reject a hypothesis has had serious ramifications in many fields of research. It even has a name: the replication crisis. This refers to a crisis whereby a sizeable proportion of the scientific community have been unable to reproduce the results of previous studies where an effect has been shown to be ‘statistically significant’.

Many argue that one of the reasons for this is that an alpha level of 0.05 is simply not strict enough to produce reliable results for a statistical ‘test’. This is a serious problem for researchers everywhere and one that is not likely to go away soon. Perhaps the only way to properly address it will be to revisit the whole paradigm of ‘statistical testing’ and how we use it to find things out.

About The Author

Jarlath Quinn

Contact us