The real world, whether it be the physical world, for example machines, or the natural world, for example human and animal behaviour, is very complex with many factors, some unknown, determining their behaviour and responses to interventions. Even if every contributory factor to a phenomenon is known, it is unrealistic to expect that the unique contribution of each factor to the phenomenon can be isolated and quantified. Thus, mathematical models are simplified representations of reality, but to be useful they must give realistic results and reveal meaningful insights.
In his 1976 paper ‘Science and Statistics’ in the Journal of the American Statistical Association George Box wrote ‘Since all models are wrong, the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary, following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist, so over-elaboration and over-parameterization is often the mark of mediocrity.’ In some ways, this is an extension of a famous saying by Einstein, ‘Everything should be made as simple as possible, but not simpler.’
The terms data mining, statistical modelling and predictive analytics are often used interchangeably whereas in fact they have different meanings, particularly data mining and statistical modelling. The aim of this blog post is to clarify these differences and to present a glossary of the following terms:
- Statistics and statistic
- Mathematical models
- Deterministic models
- Statistical models
- Data mining
Statistics and statistic
The word statistics derives from stato, the Italian word for state. The original aim of statistics was the collection of information for and about the state.
The birth of statistics as we know it today was in the mid-17th century when John Graunt, a shopkeeper from London, began reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings and deaths in each parish. The data were called Bills of Mortality and were published in a form that we now call descriptive statistics in Natural and Political Observations Made upon the Bills of Mortality. Graunt was later elected to the Royal Society.
Statistics can be used in the singular and the plural.
- The singular form is used when referring to the academic subject. For example, statistics is offered as a part of many university mathematics degrees. One definition of the subject statistics is ‘the science of collecting, organising and interpreting data’. The data analysed are usually a sample obtained from surveys, experiments or as a periodic snapshot, and the aim is to infer information about the population from which the sample was drawn. When the entire population rather than a sample from the population is analysed, the sample is called a census.
- The plural form refers to at least two quantities. For example, the mean and standard deviation are two summary statistics for continuous data.
The latter form can be used in the singular, statistic. For example, the mean is the most frequently used statistic for the central tendency of continuous data.
A mathematical model is a description of a system using mathematical concepts, for example algebra, graphs, equations and functions, and language, for example arithmetic signs. Mathematical modelling is the process of developing mathematical models.
Deterministic models and statistical models
Mathematical models can be classified as either deterministic models or statistical models.
- A deterministic model is a mathematical model in which the output is determined only by the specified values of the input data and the initial conditions. This means that a given set of input data will always generate the same output.
- A statistical model is a mathematical model in which some or all of the input data have some randomness, for example as expressed by a probability distribution, so that for a given set of input data the output is not reproducible but is described by a probability distribution. The output ensemble is obtained by running the model a large number of times with a new input value sampled from the probability distribution each time the model is run. Statistical models can be run by using Monte Carlo simulation.
So, another definition of a statistical model is a mathematical description of a system that accounts for uncertainty in the system.
Statistical modelling is the process of forming a hypothesis for a statistical model on a set of data, developing a model and then testing it on the data to see if the hypothesis is true.
Data mining is the process of analysing data to find new patterns and relationships in the data. In some ways, it is an ‘exploratory walk through the data’ without a particular objective in mind but with an open mind as to the patterns and trends in the data that will be revealed.
One of the main differences between data mining and statistical modelling is that data mining does not require a hypothesis but statistical modelling does require a hypothesis for the model. Thus, in statistical modelling a model is specified in advance but in data mining no relationships are specified.
Some misconceptions about data mining
Data mining models can be quite complex, and so users must be familiar with the models and equally important know their limitations to get maximum benefit from them. The black box nature of some data mining software makes data mining easy to be misused or used incorrectly, and this can lead to bad and very costly business decisions being taken. There are a few misconceptions about data mining that should be clarified.
Misconception 1: Data mining requires little or no human intervention.
Response 1: Data mining is a process, not an event, and data mining projects should be carried out using a structured and robust methodology, such as CRISP-DM.
With respect to the data preparation phase of CRISP-DM, data mining requires human intervention if only for the following simple and obvious reasons:
- Each set of data is unique and so has its own characteristics that determine how and to what extent they need to be prepared.
- Data preparation covers a very wide range of methods, and the particular methods used and the way they are applied depend on many things including the data (each field), the modelling methods to be used and the aims of the project.
Data mining and analytics software should not be used blindly, i.e. without understanding the modelling methods. If they are used by people who are not familiar with the methods, they will apply the wrong method to the wrong data and so will result in the wrong answer to the wrong question. The implications for business of using results generated using such an approach are self-evident.
Misconception 2: Data mining software packages are intuitive and easy to use.
Response 2:Data mining is concerned with analysis and modelling, not with IT. Data mining software is not ‘plug and play’ software and requires people with knowledge and experience of the models to use it so that the maximum commercial benefit is gained from it.
Misconception 3: Data mining can identify the causes of the problem here.
Response 3: Data mining is much more likely to identify the causes of the problem if it is used with CRISP-DM or a similar methodology. Therefore, to identify the causes of the problem, a thorough understanding of the background and context of the problem, and the data is essential.