**This is the fourth post in our ‘eat your greens’ series – a back to basics look at some of the core concepts of statistics and analytics that, in our experience, are frequently misunderstood or misapplied. In this post we’ll look in more depth at the concept of the normal distribution. **

One of the first abstract notions that students of statistics are introduced to, is the concept of a distribution. A distribution simply refers to the frequency of possible values for a particular variable. For example, how many more people visit the cinema around 5 times a year than people who go 10 or 20 times a year? In an earlier post, I explored a key statistic used to measure variation in a distribution, the standard deviation. In particular, I was looking at what makes a standard deviation ‘standard’?

Understanding this introduced us to the notion of a normal distribution. The normal distribution, also called a Gaussian distribution, looks like a symmetrical, bell-shaped curve where the mean, median and mode all occur exactly in centre. We find that in nature, all sorts of phenomena are ‘normally distributed’ from the height of marigolds, to birth weight, from reflex reaction times to the size of turtle eggs.

You will note that most of these examples refer to *the natural world not the social one*. Indeed, students soon discover that practically none of the variables in their sample datasets look anything like a normal distribution. That’s not terribly surprising, as there’s no reason to assume that distributions like age, spending, number of YouTube hits or traffic volumes should even appear as symmetrical, yet alone bell-shaped. So why is the normal distribution still regarded as so important when we rarely see an example of it? The blunt answer is ‘error’.

The practice of statistics is devoted to *estimating* values. Thus, the purpose of drawing a sample is to estimate a value in a population. Just as a researcher might take a sample of sea water to estimate pollution levels in the wider ocean. As it’s just a sample, we would expect a degree of error when using it estimate the true value in the population (also referred to as the parameter). Moreover, if we draw another sample, we shouldn’t be surprised if our sample statistic (such as a mean value) differs from the value in our previous sample.

Now imagine we are dealing with a variable that records the amount of money people spend on footwear in a year. The distribution for this variable will *not* appear as normal. It will be skewed with a long tail towards those individuals who spend a great deal more than most people do on shoes. Let’s assume that we could draw *many* *samples* of this variable repeatedly from the same population, and each time we do so, we calculate the mean spend amount. But instead of plotting the observations from a single sample, *we plot the different mean spend values* from the repeated samples. What would this distribution of sample means look like? In fact, we shouldn’t be surprised if it appears to be normally distributed. Especially if the means are based on sample sizes that are sufficiently large (say containing more than 30 cases) and we’ve taken enough repeated samples to plot a reasonably detailed chart like a histogram.

These kinds of (often theoretical) distributions are known as sampling distributions and there are various simulation tools on the web that help to illustrate them. Moreover, this phenomenon relates to a key aspect of statistics known as the Central Limit Theorem – which more or less states that given a sufficiently large sample size, a sampling distribution based on means will start to resemble a normal distribution regardless of the shape of that variable’s distribution in the population. In other words, your variable might not be normally distributed, but your* estimates* of that variable’s parameter are.

This is why students need to be able to recognise and understand the properties of the normal distribution (and how concepts like the standard deviation relate to it). Because by understanding how different estimated values are likely to be distributed, we can begin to calculate *margins of error* for our sample statistics which in turn allow us to make inferences about the population.