Introduction to the data audit node in SPSS Modeler

This is the first in a regular series of videos about SPSS Modeler, designed to help you better understand some of the functions that are available within the package. If you’re an experienced user or you have been on one of our training courses then you’ll probably already be familiar with most of these, but if you’re a new user, you’re self-taught, or you’re currently evaluating the software then there’s likely to be a number of things in these videos that you’ll find helpful.

SPSS Modeler data audit node – the Swiss army knife of data cleaning

The data audit node is a powerful tool you can use to help understand the shape and structure of your data before your analysis begins. You can also make some decisions here regarding how you might want to clean up your data, for example by dealing with missing values or extremes and outliers. In summary, the data audit node, as the name suggests, gives you an audit of your data – an overview of each of the variables. Included within this node are some charts to help you visualise your data and some summary statistics such as minimum and maximum values, mean, standard deviation and so on (where relevant).

Understanding the variables in your data

When you’re looking at a dataset for the first time it can be useful to get a quick overview of each variable. In particular it’s useful to be able to see at a glance how many valid cases there are for each variable, excluding any missing values. If you have categorical fields in your data you can also see how many unique categories there are in each variable. This feature can be particularly useful if you’re reading in a lot of data and discover that you have variables that only have one value in them – something that’s surprisingly common. You’d generally want to get rid of these variables as you’re cleaning up the data, because they’re not going to add anything useful to your analysis. The data audit node, therefore, offers a quick and easy way of identifying them.

Handing missing values, extremes and outliers

The data audit node includes a ‘quality’ tab which gives you an insight into some other aspects of the data quality. There are a number of things that you can do here to deal with unusual or non-legitimate cases. For example, the report identifies outliers and extremes in your data (using settings that you can overrule, creating your own thresholds in the data audit report depending on the requirements of your analysis). Here you can also tell Modeler how you’d like it to deal with any outliers or extreme values, and Modeler offers a great deal of flexibility. For example, you can ‘coerce’ the extremes or outliers, that is force them to take a legitimate value. Alternatively you might decide to discard them or perhaps to nullify them. Modeler also gives you the flexibility to treat outliers and extremes differently, for example by coercing extremes and nullify outliers.

When it comes to missing data, the quality report tells you what percentage of your fields are complete and what percentage have some missing data, as well as telling you what kind of missing data you have – null values, blanks or empty strings. Again, you have numerous options regarding how you want to treat different kinds of missing data in your dataset, from specifying a value (such as the mid range, the mean or a constant value) to choosing a random value or setting up an expression to calculate a value depending on different factors. You can even use an algorithm which will attempt to predict the value of the missing data, and set conditions here to determine when this algorithm is fired.

In short, the data audit node is a great place to start your analysis and offers you numerous functions that can help you better understand your data and clean it up before the analysis starts. Take a look at the video demo to see how all these functions work in practice.

Sign up for our email newsletter

I hope you’ve found this video useful. If you’d like to be notified when future videos in this series are released then why not sign up for our email newsletter? It’s full of advice, hints, tips and news for SPSS users. We’ll never share your email with anyone else and you can unsubscribe at any time.

Introduction to the data audit node in SPSS Modeler

SPSS Modeler data audit node – the Swiss army knife of data cleaning

Understanding the variables in your data

Handing missing values, extremes and outliers

Sign up for our email newsletter

About The Author

Jarlath Quinn

Contact us