How to solve the most common data cleaning problems in SPSS Statistics

Cleaning and formatting data for analysis purposes is a perpetual problem for any researcher. Indeed, one of the most challenging aspects of analysing data is that often you may not spot issues with the data until you’re already well into a project. For users of SPSS Statistics, the software includes a number of procedures that are specifically designed to deal with multiple data cleaning issues. As an added bonus, pretty much everything can be recorded as syntax, so for those who need to prepare and analyse similar data extracts on a regular basis, this can all be automated.

Here are a list of some of the most common data cleaning problems that you are likely to encounter and which SPSS procedures help to resolve them. To my mind, data cleaning issues fall into two broad classes: firstly, problems with how the data has been formatted and secondly, problems with the data itself.

Data Format Problems

When importing data from third-party platforms the data are often formatted in ways that are sub-optimal for SPSS. Even something as simple as a variable name can cause headaches, as SPSS tends to be quite proscriptive regarding the inclusion of things like special characters or field names that start with a number. Users can of course rectify issues like this by manually renaming the fields within the Variable View tab of the Data Editor window, but it’s worth knowing that the syntax command RENAME VARIABLES allows you to instantly rename a list of multiple variables in one go. In a similar vein, we can also assign or change multiple labels through the VALUE LABELS and VARIABLE LABELS commands.

Often a dataset will contain special characters that indicate if a group or value should be excluded from the analysis process. Sometimes the easiest way to deal with these data points is to assign them as missing values. In SPSS, missing values can be applied to individual values and well as ranges. They may also be assigned via syntax.

Another common issue relates to date/time fields. Data from third party survey platforms as well as text files, spreadsheets, database tables occasionally export date fields in formats that SPSS may not immediately recognise as a time-based variable. Fortunately, SPSS has its own dedicated Date and Time Wizard that allows the users to create date/time variables from fields where date/time data have been encoded as strings. It also allows users to create date/time variables from variables holding parts of date or time fields.

Data Format Problem	Solution
Edit multiple variables names	Use the RENAME VARIABLES syntax command
Edit multiple variables labels	Use the VALUE LABELS or VARIABLE LABELS syntax command
Ignore non-legitimate values	Use the MISSING VALUES syntax command
Edit the level of measurement for multiple variables	Use the VARIABLE LEVEL syntax command
Deal with Date/Time fields	Click Transform > Date and Time Wizard
Convert string variables to numeric	Click Transform > Automatic Recode

Problems with data values

The other class of problem requiring data cleaning, relates to issues with the data values themselves. In these situations, the analyst might want to take care of extreme values, data entry errors, inconsistencies, duplicate records or illogical relationships in the file. In SPSS, one of the most valuable tools for dealing with these problems is the Validate Data procedure. This is acts as a one-stop-shop for some of the most common data quality and cleaning issues that users are likely to encounter. The Validate Data procedure creates both viewer reports and fields indicating when a record or variable fails to meet a pre-specified quality threshold. By default, the procedure will flag fields containing too many missing values, insufficient variation, too many cases in a single category, too many categories with a count of 1, incomplete IDs or cases with duplicate IDs.

It also allows users to create their own single-variable rules that check whether values fall outside an accepted range (e.g., ensuring that a sample does not contain respondents who are recorded as too young or too old). These rules can be saved and shared with other users. It even comes with a set of pre-built rules that ensure variables contain legitimate UK postcodes or US State names.

The Validate Data dialog: Single-Variable Rules tab

The procedure also allows users to define their own cross-variable rules to check that the values in a given case make logical sense and do not contradict each other (e.g., discharge date recorded as prior to admission date). On top of this, SPSS contains a portfolio of other procedures that help to address data cleaning issues such as the Recode command for creating more consistent or simpler categorisations within variables, the Identify Duplicate Cases procedure which affords even more control over what constitutes a duplicate record and how the issue should be resolved, the ability to identify unusual cases and an entire suite of tools dedicated to the analysis and imputation of missing values.

Data Value Problem	Solution
Detect common data quality issues	Click Data > Validation > Validate Data
Define data validation rules	Click Data > Validation > Define Rules
Create new variable categorisations	Click Transform > Recode into Different Variables
Replace missing values in a series	Click Transform > Replace Missing Values
Identify outliers	Click Analyze > Descriptive Statistics > Explore > Statistics > Outliers
Identify duplicate cases	Click Data > Identify Duplicate Cases
Identify unusual cases	Click Data > Identify Unusual Cases
Analyse missing values	Click Analyze > Multiple Imputation > Analyze Patterns
Impute missing values	Click Analyze > Multiple Imputation > Impute Missing Data Values

How to solve the most common data cleaning problems in SPSS Statistics

Data Format Problems

Problems with data values

About The Author

Jarlath Quinn

Contact us