Using text analytics to get value from unstructured data

white ipad on red textile

What is unstructured data?

The data you have access to within your organisation can be broadly sorted into two categories: structured and unstructured. Structured data is quantitative data that can be organised into a format that can neatly be fitted into the fields and columns of relational databases or spreadsheets. Examples might include things like gender, dates, addresses, names, credit card numbers, customer satisfaction scores and so on. Structured data can be easily understood by computers and analytical software tools such as SPSS Statistics are primarily designed to work with this kind of data.

Unstructured data, on the other hand, is much less easy to quantify and does not readily fit into the structured format of spreadsheets and databases. It is qualitative data such as text, images, video and audio files and cannot be easily analysed using conventional data analysis tools. It’s generally accepted that because of this it can be much harder to analyse unstructured data. Indeed, many organisations don’t bother trying to analyse unstructured data at all.

This is a mistake because there’s a huge amount of valuable information embedded in unstructured data and organisations these days tend to have massive volumes of it. Indeed, unstructured data volumes have been growing significantly over the years. Think about things like social media posts, Amazon reviews, customer complaint forms, thank you letters, emails from clients, free text fields on forms – all of this is unstructured data. Imagine if you could unlock the insights hidden collectively in this data. This is where text analytics comes in. Text analytics tools such as those that come with SPSS Modeler enable you to do exactly that.

What is text analytics?

The process of text analytics involves assigning portions of text to themed categories. Either these categories can be predetermined by the researcher before the analysis starts or they can emerge from the data as the analysis progresses. Traditionally text analytics was done manually by human coders working through data and assigning cases to categories ‘by hand’, significantly limiting the volumes of data that could be meaningfully managed in this way.

How have technological developments enabled more sophisticated text analytics?

Technological developments have affected the practice of text analysis in two main ways. Firstly, huge volumes of data are now available to researchers for the purposes of text analysis. In particular, the internet and social media have led to a huge proliferation of text-based data, most of which is relatively easily accessible to text analysts. At the same time, computing power has grown, enabling the automated text analysis of massive volumes of data to be performed extremely quickly. Analysis which would once have required access to a mainframe or supercomputer can now be performed using sophisticated content analysis software on a user’s desktop.

Additionally, the sophistication of computer text analysis applications has grown massively. Early text analytics applications did little more than simply count the occurrences of particular words in a portion of text but much more sophisticated analysis than this is possible now. For example, SPSS’s Text Analytics module uses natural language processing, can handle extremely large datasets and enables integration between structured and unstructured data. It goes far beyond merely counting words and instead uses an understanding of sentence structure, context and meaning in order to group concepts intelligently as well as identifying mentions of entities such as people, places and organisations.

This text analytics data can then be used as part of a core statistical model. The researcher can build a core predictive model. She can then code her text data in Text Analytics (or allow Text Analytics to do this automatically) and use the results of that analysis to create variables which can then be seamlessly fed back into the initial model. This can be run both with and without the text data to see how the text variables affect the model’s predictive power.

What are the benefits of text analytics?

Text analysis, particularly of social media posts or online product reviews, offers some important benefits to researchers. Firstly, it is unobtrusive. By that I mean that it generally uses data that is generated by people living their real lives, behaving as they normally would rather than using data that’s specifically generated for research purposes. For example, if you conduct a content analysis of all the tweets than mention your brand name you are building your understanding of what they really think, potentially more so that if you did a customer satisfaction survey. People do not always tell the truth to interviewers or researchers however the chances that their behaviour on Twitter is being influenced by the fact that researchers may access their tweets at some future point is minimal.  

An additional benefit of text analytics, particularly using social media data such as tweets, is the immediacy of the data collection. Tweets meeting pre-determined criteria can be collected virtually in real time and the analysis can begin straight away. Relatively little data preparation is required compared to that which would be needed for analysis of other kinds of content. The tweets are collected complete and already in digital form. They do not require transcription or digitisation, simplifying the process considerably (as well as reducing the cost) compared to more traditional content analysis of, for example, complaint letters. A well-known mobile network provider discovered that real time text analysis of tweets mentioning their brand name could be used to alert them to faults and outages much more quickly than waiting for their standard procedures to kick in.

Finally, text analytics can unlock important insights from your unstructured data that can otherwise go unnoticed. Take complaints data as an example. Typically, organisations manage complaints individually as they come in rather than analysing them as a whole. But taking all complaints collectively and using them as a dataset for text analytics can unlock important insights about the types and nature of complaints that come up most frequently – insights that can easily be missed when assessing complaints one by one.

Scroll to Top