# Frequently asked questions

Here are the answers to some of the questions that we’re most frequently asked. If the answer to your question isn’t here then please use the form on the right to send us your question and we’ll get back to you with an answer as quickly as we can.

If you are an existing customer with a support contract then please contact customer support.

## SPSS Statistics

#### What’s new in SPSS v27?

In June of this year, IBM released the latest version of SPSS Statistics. Version 27 introduces several additional analysis procedures as well as new system enhancements. In this video we take a tour of some of the most valuable improvements that have been made. You can also read more about the changes in this blog post.

#### How do I upgrade to SPSS v 27?

In order to complete an upgrade to V27 you will need to have an up to date support and maintenance contract with Smart Vision.

If you do not have an up to date support contract then it is possible to buy reinstatement licences to get the upgrade organised. If you would like to investigate this route please get in touch via [email protected] with details of the licence you hold and we can review this for you.

It’s a simple process to upgrade from one version of SPSS to the next. The first thing you need to do is ‘unlock’ your existing installation in order to enable the upgrade, then download the version 27 files, install v27 and update your authorisation code. Detailed instructions on how to do this are given below.

If you hold Concurrent licences however, please contact our support team ([email protected]) to plan your upgrade. There is also an upgrade to the central licence manager and the team can assist you with this as it is a more complex process.

## Step one – ‘unlock’ your existing installation

- Begin by starting the SPSS License Authorisation Wizard
- Go to Start -> All Programs / Applications -> IBM SPSS Statistics -> SPSS License Authorization Wizard (LAW)
- Allow the program to make changes to your machine by selecting ‘Yes’ when asked on screen
- Follow the on screen instructions until presented with the following dialog box…

## Step two – email Smart Vision to kick off the ‘unlock’ process

- Email Smart Vision support ([email protected]) or call (0845 680 0408) and let us have the machine lock code and your authorisation code if you still have it from your existing installation
- Once we have the machine lock code (and where possible the authorisation code of the current install) we will then work with IBM to ‘unlock’ the code for you
- This process normally takes 24 hours to complete, at the end of which we will supply you with a new authorisation code for v27
- Note, if you don’t have the authorisation code from your current installation then please allow an extra day for us to get this released

## Step three – install v27 on your machine

- Download the v27 files here – to access the page you will need the password that we emailed to you in the upgrade notification email (contact us if you have not received this email and would like to upgrade)
- Install v27 on your machine as you normally would
- v27 will install over the top of your previous version – you do not need to uninstall your previous version.
- Use the updated authorisation code we’ve given you to activate v27
- If there is an issue with authorisation due to internal internet security, then you will need to provide us with the new lock code from version 27 so that we can generate a long licence code to use instead

#### How do I download my SPSS software once I’ve purchased it?

After you’ve purchased SPSS software from us you’ll be able to download it from our website. After your purchase you will receive the following email containing your username and a link to set up your password.

Clicking the link will open your web browser to the following page allowing you to set a personal password. Choose to Reset Password then click Log in.

Once you have logged in you will see the dashboard below.

From here you can manage your account and download the software you require. Simply click the File Manager option on the left-hand side.

You can download whichever version of the software you wish (to go with the authorisation code you have been supplied with).

You can return to this page at any time.

## Contact us

If you have any problems, please don’t hesitate to get in touch.

If you’re a Smart Vision customer with a support contract then you can contact our technical support team either by email or by calling 0845 680 0408.

Technical support is available Monday – Friday (excluding bank holidays), 9am-5.30pm

## Download these instructions

If you’d like a printable PDF of these instructions you can download a copy here – just click the link and the file will automatically download.

#### How do I know which SPSS modules I have installed?

Sometimes it can be useful to know which SPSS modules you have installed and are licensed for. There’s a simple syntax command that can give you this information quickly and easily. Our video guide shows you how, or check out the written instructions below.

### Checking which modules are installed

First, open a new syntax window by selecting New / Syntax from the File menu, then type the command ‘show all’ (see the screenshot below).

This procedure will create a range of output, part of which is a table which lists all the installed modules, as shown in the example below.

#### How do I update my SPSS activation code?

When your licence comes to an end you will need to buy a licences, you do not need to reinstall the software you simply need to update your license activation code. We will supply you with a new code at the point at which you buy the new software licence. To update it, simply follow the instructions below.

**What happens at the end of my free trial or when my licence comes to an end?**

**If you have been running a free trial**

If you are going to be running a paid installation of SPSS Statistics or Modeler on the same machine that you used for the free trial then you do not need to download the software again. You simply need to update your license activation code. We will supply you with a new code at the point at which you buy the software. To update it, simply follow the instructions below.

**If your paid licence period has come to an end**

You do not need to download the software again. You simply need to update your license activation code. We will supply you with a new code at the point at which you renew your licence. To update it, simply follow the instructions below.

**Step one**

Locate the SPSS License Authorisation Wizard.This is in the Programme Menu in Windows.If you’re using a Mac then navigate to this via your Applications folder.

### Step two

Click on the License Authorisation Wizard. You’ll then be asked for permission to run the license manager. Click yes and the current license status will then show.

### Step three

Select ‘License my product now’

### Step four

Enter License Codes.You can enter additional codes following the first code to activate multiple products.

#### How does the SPSS License Manager work?

If you opt for a concurrent user license then you can install SPSS on as many machines as you like, as long as it is only ever used by one person at a time. To control this you would install SPSS License Manager. The license manager is a separate application which will need to be installed on a central machine/server to host the license.

Once the license manager is installed and license is added, this machine will have a background service running (at all times as long as the machine is on) that will host the licenses. Each end user then needs to point their software to this machine (either the IP address or server/machine name).

From then on, every time the user opens SPSS Statistics they will automatically retrieve the license to run the server (providing no one else is using it at the time). The end user can also use the Commuter application to check out a license in case they want to use it outside the network (for instance a laptop user who wants to take continue working out of the office).

To ensure the license is always returned there is a time limit which will automatically return the license to the license manager.

#### How do I change the language in SPSS?

You can easily change language from English to another language of your choice directly within SPSS Statistics. This can be done in the Edit menu by selecting ‘options’ and then ‘language’. You can change the language of three different aspects of SPSS – your output, the user interface and the character encoding for data and syntax, depending on your requirements.

#### How do I transfer my SPSS license to another machine?

In order for Smart Vision Europe Ltd to be able to assist you with the transfer of an SPSS license from one computer to another you will need to be able to demonstrate one of the following:

- Have an up to date support and maintenance contract with Smart Vision Europe Ltd
- To have had a previously supported license that was purchased via Smart Vision Ltd and for which Smart Vision Europe was the designated support provider

If your license was purchased direct from IBM or elsewhere you will need to contact your designated support provider as Smart Vision Europe will not have access to your license code and we will not have the required authorisation to unlock the activation code for you.

If you have installed SPSS on one machine and then need to move it to another machine then here’s the process for doing so.

- Begin by starting the SPSS License Authorisation Wizard
- Go to Start -> All Programs / Applications -> IBM SPSS Statistics -> SPSS License Authorization Wizard (LAW)
- Allow the program to make changes to your machine by selecting ‘Yes’ when asked on screen
- Follow the on screen instructions until presented with the following dialog box…

- Call or email Smart Vision support and let us have the machine lock code
- Once we have the activation code and the machine lock code of the current install we will then work with IBM to ‘unlock’ the code
- This process normally takes 24 hours to complete
- Once unlocked you will be able to use the activation code to install on a new PC

#### What’s the difference between the various license types available for SPSS?

IBM SPSS Advanced Analytical Products, including IBM SPSS Statistics and IBM SPSS Modeler, can be implemented and licensed in different ways. How your organisation licenses and implements these tools will depend on its requirements. This page explains the options.

### Client licence types

The client (i.e. desktop) versions of both the Statistics and Modeler product lines are licensed in one of two ways. The difference between the two is to do with who can use the software.

**1. Authorised user licences**

An authorised user licence is tied to a specific named individual and is for their exclusive use. It can be installed on a single machine.This form of licence is also known as a **standalone licence** or a **named user licence**. You can buy multiple single user licences for deployment throughout your organisation.

**2. Concurrent user licences**

Concurrent user licences allow anyone in the organisation to use the product but stipulate the maximum number of people who can use it at any one time. For example, a two-user licence means two people can both be using the software but if a third person attempts to access, they will be refused until one of the first two people logs out. A licence manager application is installed centrally to control this access. This type of licence is also known as a **network licence**.

Which of these options you choose will be influenced by the characteristics of your team.

· Depends on whether you have full time or part time users

· Depends on the size of your team

· Depends on if you have concurrently running projects or one off busy times

### Client or client/server installation?

#### Local client installation

SPSS products are ‘fat client’ software products. This means that you can install a version locally on any individual user’s computer (often referred to as the client machine) and that user can then access data on that computer (for example as a flat file or a range of flat files) as well as data that resides in any other location to which the user can connect their computer (for example a network drive or using database connectivity protocols such as ODBC).

For many applications the local (client) install option will cater for all your analytical requirements. However there are also situations where an organisation may require different software set up. This is often where an organisation may consider using client/server architecture.

#### Client/server installation

In this configuration the local client install of IBM SPSS (Statistics, Modeler or both) will connect to a copy of the equivalent software that is designed to run on a server. The server is just another computer on your organisation’s network (likely to be of a higher specification with more processing and storage capacity than a typical client machine) to which multiple client machines can potentially connect.

### Which option is right for me?

Whether you go for a local client installation or a client/server installation depends on the way in which you intend to use SPSS and what type of analysis you’ll be performing. Whether you need it really depends on the data volumes you wish to analyse. It is not likely that you will need it if you are analysing small samples of data, however a server licence will improve your productivity if you are analysing large data files or running analysis directly against a data warehouse or data mart. Bear in mind that it is often the width of a file (how many variables) you have rather than the length (how many records) which influences processing times as well as the types of procedure you are likely to run. In general you will probably find the client/server option more suited to your needs in the following kinds of situations:

- If you need to analyse large volumes of your organisation’s data (large numbers of variable and many rows) or are performing particularly complex analysis requiring a high degree of processing power to be able to run.
- If you’re running data management and manipulation tasks using structured query language (SQL) – Modeler Server can request that SQL is run within the database, offering another approach to making the whole process more scalable.
- If you need to make use of the processing capacity of the ‘database tier’ of your organisation’s IT infrastructure. Often a company’s databases or data warehouse will run on large and powerful dedicated servers. Increasingly database software includes modelling algorithms. Modeler Server can leverage this database capability through a technique called ‘in database mining’. In this context the model building can be run inside the database where processing can be completed even more quickly.
- If you need to run scheduled batch and automated processes, such as having a whole analytical process run overnight when lots of processing capacity is available.

### What is the cost of a client/server installation?

The cost of a client/server installation of SPSS Statistics or Modeler is influenced by two elements:

- The cost of the client licenses – this is a number of users who will be using the client software.
- The cost of the server license – this cost is calculated on the processing capacity of the server (or virtual server) that you intend SPSS to run on. The basic principle is that the more powerful the server, the more you would need to pay for the server license.

### Licence length

In addition to considering whether to go for the client or client/server option, you also need to think about what kind of licence you’ll need.

#### Perpetual licence

Under the terms of a perpetual licence you pay a single one-off fee. This covers the cost of the software and your first year’s maintenance. At the end of the year you can opt to renew your maintenance contract if you wish, or you can let it lapse. You then have access to the software in perpetuity, but you will not have access to support, maintenance or upgrades and patches.

#### Fixed 6/12 Month Term

It is also possible to purchase a license entitlement over a fixed 6* or 12 month term. This would provide your organisation with 6 or 12 months’ access to the licenses you have purchased. After the license period has elapsed the software will no longer be accessible, unless the license term has been renewed. No data or other related assets would be lost or destroyed in the case of non-renewal of a fixed term license; but you wouldn’t be able to access the software anymore.

#### SaaS – Software as a Service

IBM SPSS Modeler Gold is available via hosted delivery. This means that you can access it via the internet through a web browser. This has the advantage of not requiring any local install. SaaS delivery is similar to a rental license type. The minimum term for a SaaS license is three consecutive months.

*Available for SPSS Statistics Base, Regression, Advanced Statistics, Custom Tables, Decision Trees and Categories

#### How can I activate my SPSS license on a machine without internet access?

From time to time we come across IBM SPSS Statistics or Modeler customers who need to install the product on a computer without internet access, or who have a firewall that is preventing the License Activation Wizard from working. It is still possible to install your software in this situation – simply follow the instructions below.

First, the license key administrator should go to the page where the authorization code is generated (see the example below) and click on the **authorization code.**

This will take you to a page that asks for the lock code, as per the screenshot below.

The lock code is a unique identification number linked to a particular machine. You can get your lock code by opening the licence activation wizard (installed on your machine at the same time as you installed the software). Click ‘next’ and you should see the lock code, as per the screen shot below.

Note that if you are running a concurrent licence then you will need to run the file ‘echoid’ on the command line from the concurrent licence manager installation in order to get the lock code.

One the administrator has the lock code for your machine they should enter this into the license key centre screen and click activate. This then generates the full licence code. Click on the licence code to generate the full text of the code, which the administrator then needs to give to you. You should then enter this code in to the licence authorization wizard screen that asks you to enter the licence code. The software should then be activated.

#### How can I combine variables in SPSS Statistics?

SPSS users often want to know how they can combine variables together. In this video Jarlath Quinn demonstrates how to use the compute procedure to calculate the mean of a number of variables to create one combined variable, and also how to use the count values procedure to count how many times a particular value occurs across a series of variables in order to create an overall count.

#### How can I reverse a scale in SPSS Statistics?

In this video Jarlath Quinn demonstrates how to reverse the values of a rating scale (such as an agreement scale or a satisfaction scale) in SPSS Statistics, so that the highest value becomes the lowest value and vice versa. Jarlath shows two methods of doing this – one using the compute procedure and the other using the recode procedure.

#### How can I calculate with dates in SPSS Statistics?

In this video Jarlath Quinn demonstrates how to work with date and time variables in SPSS using the SPSS date and time wizard. This enables you to:-

- Calculate time units between two dates
- Add / subtract time units to or from dates
- Extract part of a date or a time, such as days of the week or months of the year
- Create date or time variables from variables holding part of dates or times

#### How can I check my data for normality in SPSS Statistics?

When you’re deciding which tests to run on your data it’s important to understand whether your data is normally distributed or not, as a lot of standard parametrical tests assume a normal distribution whereas other non-parametric tests are designed to be run on data which is not normally distributed. A normal distribution has a number of characteristics:-

- It is symmetrical
- It is bell-shaped
- Its mean, median and mode all appear at the same place
- Normal distributions can be divided up into the same proportions by the standard deviations, so 95% of the area under the curve lies within roughly plus or minus two standard deviations of the mean

In this video Jarlath Quinn demonstrates how to use the functions within the explore command in SPSS Statistics to test for normality.

#### How can I recode my data in SPSS Statistics?

Recoding your data means changing the values of a variable so that they represent something else. Within SPSS Statistics there is more than one type of recode that can be performed. In this video Jarlath Quinn demonstrates how to:-

- Recode into the same variables, overwriting an existing variable
- Recode into different variables, creating a new variable in addition to your existing variables
- Automatically recode, a particular procedure designed to change string codes into numeric codes
- Visual binning, visualising a distribution in the form of a histogram and slicing it into ranged categories

#### How can I create grouped or banded variables in SPSS Statistics?

SPSS users often want to be able to create grouped or banded data from continuous fields such as, for example, creating age groups or income bands from continuous fields. In this video Jarlath Quinn demonstrates how to use the visual binning procedure within SPSS Statistics to do this including how to control the proportion of cases that fall into each band and how to automatically create value labels.

#### How can I merge files in SPSS Statistics?

In this video Jarlath Quinn demonstrates how to merge data files within SPSS Statistics using each of the two main methods, either adding cases (combining files with the same fields but additional rows) or adding variables (combining files by joining variables to a target file using something like an ID field as a ‘keyed variable’).

#### How can I change the appearance of my output in SPSS Statistics?

We’re often asked how you can change the appearance of the tables that SPSS generates as output. In this video Jarlath Quinn demonstrates two different ways to do this, either by choosing a different table look in the edit / options function, or by editing the table properties directly yourself.

#### How can I select cases in SPSS Statistics?

In this video Jarlath Quinn demonstrates how to use SPSS Statistics to define data filters in order to select particular cases for analysis. This can be done either to create a temporary selection or to create a permanent new file with only a subsection of cases included within it. The video demonstrates how to do this with string variables too, as well as how to combine conditions from multiple variables in your selection.

## Statistical techniques

#### How do I choose the correct statistical test?

When you’re conducting any kind of statistical analysis, it’s vital that you select the correct tests to perform, given the characteristics of your data and the analytics outcomes that you’re hoping for. If you don’t choose the right tests then the results you generate can be meaningless and this can lead to business decisions being made based on faulty analysis.

To help you understand which tests to use we’ve put together this simple infographic that outlines the key factors you will need to consider and the decisions you’ll need to make at each stage. Using this infographic should direct you to the correct test and maximise your chances of generating useful and accurate findings that you help you make effective business decisions.

#### What’s the difference between the various types of statistical models?

The real world, whether it be the physical world, for example machines, or the natural world, for example human and animal behaviour, is very complex with many factors, some unknown, determining their behaviour and responses to interventions. Even if every contributory factor to a phenomenon is known, it is unrealistic to expect that the unique contribution of each factor to the phenomenon can be isolated and quantified. Thus, mathematical models are simplified representations of reality, but to be useful they must give realistic results and reveal meaningful insights.

In his 1976 paper ‘Science and Statistics’ in the *Journal of the American Statistical Association* George Box wrote ‘Since all models are wrong, the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary, following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist, so over-elaboration and over-parameterization is often the mark of mediocrity.’ In some ways, this is an extension of a famous saying by Einstein, ‘Everything should be made as simple as possible, but not simpler.’

The terms data mining, statistical modelling and predictive analytics are often used interchangeably whereas in fact they have different meanings, particularly data mining and statistical modelling. The aim of this blog post is to clarify these differences and to present a glossary of the following terms:

- Statistics and statistic
- Mathematical models
- Deterministic models
- Statistical models
- Data mining

## Statistics and statistic

The word statistics derives from *stato*, the Italian word for state. The original aim of statistics was the collection of information for and about the state.

The birth of statistics as we know it today was in the mid-17th century when John Graunt, a shopkeeper from London, began reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings and deaths in each parish. The data were called Bills of Mortality and were published in a form that we now call descriptive statistics in *Natural and Political Observations Made upon the Bills of Mortality*. Graunt was later elected to the Royal Society.

**Statistics** can be used in the singular and the plural.

- The singular form is used when referring to the academic subject. For example, statistics is offered as a part of many university mathematics degrees. One definition of the subject statistics is ‘the science of collecting, organising and interpreting data’. The data analysed are usually a sample obtained from surveys, experiments or as a periodic snapshot, and the aim is to infer information about the population from which the sample was drawn. When the entire population rather than a sample from the population is analysed, the sample is called a census.
- The plural form refers to at least two quantities. For example, the mean and standard deviation are two summary statistics for continuous data.

The latter form can be used in the singular, statistic. For example, the mean is the most frequently used statistic for the central tendency of continuous data.

## Mathematical models

A **mathematical model** is a description of a system using mathematical concepts, for example algebra, graphs, equations and functions, and language, for example arithmetic signs. **Mathematical modelling** is the process of developing mathematical models.

**Deterministic models and statistical models**

Mathematical models can be classified as either deterministic models or statistical models.

- A
**deterministic**model is a mathematical model in which the output is determined only by the specified values of the input data and the initial conditions. This means that a given set of input data will always generate the same output. - A
**statistical**model is a mathematical model in which some or all of the input data have some randomness, for example as expressed by a probability distribution, so that for a given set of input data the output is not reproducible but is described by a probability distribution. The output ensemble is obtained by running the model a large number of times with a new input value sampled from the probability distribution each time the model is run. Statistical models can be run by using Monte Carlo simulation.

So, another definition of a statistical model is a mathematical description of a system that accounts for uncertainty in the system.

**Statistical modelling** is the process of forming a hypothesis for a statistical model on a set of data, developing a model and then testing it on the data to see if the hypothesis is true.

## Data mining

**Data mining** is the process of analysing data to find *new* patterns and relationships in the data. In some ways, it is an ‘exploratory walk through the data’ without a particular objective in mind but with an open mind as to the patterns and trends in the data that will be revealed.

One of the main differences between data mining and statistical modelling is that data mining does not require a hypothesis but statistical modelling does require a hypothesis for the model. Thus, in statistical modelling a model is specified in advance but in data mining no relationships are specified.

**Some misconceptions about data mining**

Data mining models can be quite complex, and so users must be familiar with the models and equally important know their limitations to get maximum benefit from them. The black box nature of some data mining software makes data mining easy to be misused or used incorrectly, and this can lead to bad and very costly business decisions being taken. There are a few misconceptions about data mining that should be clarified.

**Misconception 1**: Data mining requires little or no human intervention.**Response 1**: Data mining is a process, not an event, and data mining projects should be carried out using a structured and robust methodology, such as CRISP-DM.

With respect to the data preparation phase of CRISP-DM, data mining requires human intervention if only for the following simple and obvious reasons:

- Each set of data is unique and so has its own characteristics that determine how and to what extent they need to be prepared.
- Data preparation covers a very wide range of methods, and the particular methods used and the way they are applied depend on many things including the data (each field), the modelling methods to be used and the aims of the project.

Data mining and analytics software should not be used blindly, i.e. without understanding the modelling methods. If they are used by people who are not familiar with the methods, they will apply the wrong method to the wrong data and so will result in the wrong answer to the wrong question. The implications for business of using results generated using such an approach are self-evident.

**Misconception 2**: Data mining software packages are intuitive and easy to use.

**Response 2**:Data mining is concerned with analysis and modelling, not with IT. Data mining software is not ‘plug and play’ software and requires people with knowledge and experience of the models to use it so that the maximum commercial benefit is gained from it.

**Misconception 3**: Data mining can identify the causes of the problem here.**Response 3**: Data mining is much more likely to identify the causes of the problem if it is used with CRISP-DM or a similar methodology. Therefore, to identify the causes of the problem, a thorough understanding of the background and context of the problem, and the data is essential.

#### What is a chi-squared test and when would you use it?

Take a look at the table below. It describes a relatively common situation in business analytics. Two offers have been made to a sample of 40,000 prospective readers of a magazine. As an experiment, half of the prospects have been offered a 25% discount for the first year and the other half have been offered an extended subscription of 15 months (rather than the normal 12 months).

The table seems to indicate a slight increase in the response rate (a mere 0.4%) for those offered the extended subscription. The business analysts want to know how probable it is that this is simply a random effect or whether the extended subscription is indeed more likely to elicit a response. If by this stage you are thinking that a tiny difference of 0.4% really isn’t worth bothering about, you might want to consult this article.

In modern parlance, this type of problem is commonly referred to as **A/B testing** but in reality the methods used to address it are over 100 years old. One such approach is apply a test of statistical significance such as the **Pearson Chi Squared test**. Generally speaking, tests like chi-squared are used to examine differences with between fields with different categories.

Without getting too deep into the technicalities of how it is calculated, the Chi squared value is derived by comparing the frequency values (count) that we observe in a table and the *expected* frequencies that we would expect to see if there was no bias towards one group or another. If we flip a coin a 100 times and the outcome is 54 ‘heads’ and 46 ‘tails’ we may not instinctively be able to calculate the probability that the coin is biased but we know that *if it wasn’t biased, *on average we should see a 50/50 outcome: this represents our expected frequency.

Let’s look at the table again, but this time we can compare the observed and the expected frequencies (which are shown in red).

We can see that the 25% discount offer elicited 39 fewer responses than would be expected if both offers had the same effect (whereas the exact opposite is true of the extended subscription offer). The Chi-Squared calculation sums these differences between the observed and expected counts and then (with a few adjustments depending on the method used) calculates the probability that the differences we have observed in the table are the result of random chance rather than a real effect that is likely to exist in the population of all potential magazine subscribers. In other words, it indicates whether the extended subscription offer is likely to be more tempting to prospects than the 25% discount offer. The table below shows the results of the Chi-Squared test.

The value that we need to focus on is in the row marked ‘Sig.’. This is an estimate of probability. In this instance, a probability value of 0.031 indicates that the differences between the groups are sufficiently large that we would only expect to observe this 3.1% of the time randomly. In short, the difference between the response rates for the two offers is not very likely to be the result of mere chance.

So how small does the probability have to be before you come to this conclusion? Well, that depends on the context of the analysis, but generally a value less than 0.05 (or 5%) is regarded as small enough to be viewed as ‘statistically significant’. In this case the business analysts could conclude that there is good evidence to suggest that the extended subscription offer is likely to yield slightly more customers than offering a 25% discount.

#### What is correlation and when is it useful?

Correlation is a term that we employ in everyday speech to denote things that appear to have a mutual relationship. In the world of analytics correlations are specific values that are calculated in order quantify the relationships between variables. This kind of analysis is powerful because it allows us measure the association between factors such as advertising spend and website hits, product sales and competitor pricing, Net Promoter Score and customer discount, ambient temperature and component part failure.

Not only can we measure this relationship but we can also use one variable to predict the other. For example, if we know how much we’re planning to increase our spend on advertising then we can use correlation to accurately predict what the increase in visitors to the website is likely to be. This is because, within certain limits, we can measure the correlation using a specific number.

## Visualising correlation using scatterplots

The relationship between two variables can be visualised using scatterplots, as in the examples below.

- Scatterplot A shows the relationship between vehicle weight and horsepower
- Scatterplot B shows the relationship between vehicle miles per gallon and the time it takes to accelerate from 0-60mph
- Scatterplot C shows the relationship between vehicle horsepower and time taken to accelerate from 0-60mph

Graph A shows a strong positive relationship between the horsepower of various cars and the respective weight of the vehicles. Graph B also shows a positive relationship (although not as strong as graph A) between the time taken to accelerate to 60mph and the car’s fuel consumption in miles per gallon. Finally graph C shows a strong *negative* relationship between horsepower and time taken to accelerate to 60mph (in other words less powerful cars accelerate more slowly).

## Quantifying a relationship using the correlation coefficient

Although scatterplots help us to *visualize *relationships like this, they don’t allow us to *quantify* the pattern. This is where a correlation coefficient comes in handy. Using a Pearson’s correlation coefficient (sometimes denoted as Pearson’s *r*) we can measure the strength of the linear (i.e. straight line) relationship within each graph. Pearson’s correlation coefficient is a number that runs from -1 to +1 (for a more technical explanation click here).

Values approaching either of these two numerical limits indicate stronger linear relationships, whereas values closer to 0 indicate weaker relationships (or no relationship at all). A positive correlation value means that the variables concerned increase or decrease in parallel – as one increases or decreases so does the other – whereas a negative correlation value indicates that as one variable increases the other decreases, or vice versa. Let’s look at the graphs again but this time we will reveal their correlation values.

We can see that graph A with a correlation of 0.86 indicates a much stronger relationship between horsepower and weight than Graph B’s correlation of 0.43 measuring the relationship between acceleration and mpg. Whereas graph C also indicates a strong linear relationship, here it generates a negative value of -0.7, meaning that as horsepower *increases* so acceleration time *decreases*. If the relationships between these factors were much weaker, then we would expect to see correlations with much smaller magnitudes such as 0.2 or 0.1.

Using exactly the same approach we can measure and *compare* the relationship between initial spend and tenure, or between advertising and new registrations, or repeat visits and waiting times. We’ve all heard the adage that “correlation does not imply causation”. That makes sense because there is almost certainly a correlation between ice cream sales and incidents of drowning but it’s most likely due to coincidence i.e. the fact that ice cream sales are higher in summer when a lot more people go swimming. However, even though correlation doesn’t imply causation, very often the fact that we measure the strength of a relationship and prove its existence in the first place can greatly enhance our decision making.