What skills are needed to successfully run data science projects in the cloud?

In my most recent blog post I wrote about the benefits and opportunities that cloud computing brings to data science projects and reciprocally, how data science projects provide a great way to demonstrate the value of cloud computing. In this post I want to discuss the cloud computing skills challenge and provide some advice on how best to ensure you have the skills needed to capitalise on the benefits of the cloud in your organisation.

The ongoing development of cloud computing has brought with it a changing skills demand. Without investment in the right skills, organisations will struggle to realise the benefits of both cloud computing in general and data science more specifically. So how do you ensure your organisation has the skills that it needs?

1. Welcome back to the command prompt…

If we could wind the clock back to the late 1980s and early 1990s, and we were to sit in front of a personal computer (or a mainframe or UNIX terminal for that matter) of that era, we would likely have been presented with a blinking command prompt. Whether you were using a word processing application such as Word Perfect, analysing data using SPSS or SAS or looking for saved files on your computer’s hard drive, typing commands was the only interface available to users.

Computer users today are used to an intuitive graphical user interface, so going back to the command line can be a daunting prospect. ‘How do I make this computer do useful stuff?’ would be an understandable response. One of the challenges of today’s cloud applications, especially at the database, data ingestion and sometimes the modelling level is that they often require a user to be comfortable using code.

If you have been around technology as a business user for more than about 20 years, this will feel like going full circle. Whilst the cloud-based applications that enable the full cycle of data science are evolving rapidly, many still require a degree of scripting and coding. These script and code-based languages can learnt of course, but that requires an investment of time (and potentially money) for training which is an overhead for the organisation and its users. However if you want to make the most out of the cloud then you will need to ensure that you or your organisation has access to these capabilities.

2. You may not be able to bring your old familiar tools with you…

Very much related to the above point, it is also true to say that many of the on premise, locally installed applications that analysts and data scientists are now used to have not been magically ported to cloud environments. This means that there is likely to be some new learning to do, as data science practitioners who have perfected their craft on one set of analytical tools may need to invest in learning and becoming familiar with a new tool kit. This can feel daunting.

That said, there is more encouraging news. Many technology vendors who are now investing heavily in data science tooling in the cloud already have a long history of developing excellent data science software. These vendors are evolving their cloud applications to incorporate the very best of the interfaces from their classic data science products. A good example of this is from IBM. IBM acquired the market leading SPSS data science software vendor back in 2010 and has continued to develop and invest in both SPSS Statistics and SPSS Modeler. Many of the great interface features that were present in the on premise products are now being re-interpreted and delivered as part of the contemporary suite of cloud data science tools such as Watson Studio . This is a major boon for experienced data scientists, and it provides a significant boost to productivity for individual data science practitioners and teams migrating to a cloud environment.

The other piece of good news is that the data science skills and understanding that experienced users have built up, in terms of methodology and approach, still carries just as much value. Data science is an applied discipline with real world and practical outcomes, it is never just about the technology and software tools.

3. What did the IT department ever do for data scientists anyway?

The arrival of cloud computing has fundamentally changed the way that IT teams need to work with the communities of technology users that they support as well as changing what they need to do. In short it has made the users and IT teams more interdependent. This is especially true of the relationship between the data science team and IT.

Pre cloud computing it was not unusual to see data scientists actively avoiding engaging with their IT department, viewing them as a delivery bottleneck or an area of the business that inadvertently suffocated innovation. It was common to speak with data science teams, especially in larger organisations, that had created data infrastructure and data storage with connected analytical tools that were not on the radar of the IT team.

Whilst perhaps understandable, this ‘off radar’ approach, for a mainstream data driven decision support function such as data science, is not good for the organisation and creates all manner of potential risks. Cloud computing has changed that. IT is now a key enabler of secure, scalable, on demand computing infrastructure via the cloud, and can add a full suite of data science applications that will keep pace with the demands of the data scientists.

Moreover, a well-considered cloud environment will provide appropriate, reliable resources for more established and critical analytical and predictive modelling processes, whilst also allowing the more innovative ‘skunk works’ projects to be undertaken by the data science team. IT is no longer the bottleneck. Instead, in the cloud computing era, IT is a key partner and enabler for data science. This is a skills consideration because data science in the cloud hinges even more strongly on the new cloud infrastructure and IT hold the keys to this. The ability to communicate, collaborate, plan and engage with with colleagues in IT is more important than ever.

4. Cloud computing changes everything and nothing!

One of the most appealing aspects of being involved with data science is how applied it is. By that I mean that if the output of your data science activity it not constantly challenging and changing organizational behaviour, then you are doing it wrong! The two biggest and most common failure points of data science projects are:

A failure to properly define and document the question that you are trying to answer, which usually includes a failure to think hard enough about what success actually means and how you will measure it
A failure to plan carefully enough how you will deploy (use) the results. This really means a failure to push through the required change of behaviour that will deliver improved outcomes.

At their core, neither of these failure points is a technology problem. They are methodology and approach challenges (and dare I say it, organisational culture challenges).

The way that we avoid these pitfalls is to use a suitable data science / data mining methodology. Having a suitable approach and methodology provides a common lingua franca for all the collaborators to use and provides a checklist to ensure you are not missing key steps and checkpoints along the way. At Smart Vision we favour the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology.

CRISP-DM has been around a long time and it has stood the test of time. Other equally good approaches are out there and are variations on the theme but having an underpinning approach and methodology will save time and minimize the number of projects that do not deliver useful outcomes. This is true regardless of the enabling technology. Strong project management skills, with a specialization for data science incorporated, are key requirements.

5. Overlook the subject matter experts (SMEs) at your peril…

Data science is a technical and scientific discipline but, as touched on above, it is also a highly applied activity. It is of most value to organisations that do other things as their core activity e.g. sell financial services products, campaign and raise funds for good causes, run online and high street retail businesses, manufacture food and other consumables etc. You name it, almost every business that operates at scale will benefit from data science i.e. the act of analysing and modeling the data that it produces, captures and stores (about its customer, processes and products) to inform decision making.

Inherent in these activities is detailed knowledge and understanding of that business. Based on long experience, the most successful data science practices are those that have ensured the involvement of the subject matter experts in the business. These may not be the folks doing the actual analysis (although sometimes this is the case) but they must be part of the process and methodology. Involving the right SME as part of any project will help avoid blind alleys, provide invaluable guidance on what outputs and outcomes are most useful, inform the data scientists when the approach being taken may be impractical in day-to-day operations and help avoid projects whose best outcome is re-statement of the obvious.

In summary, cloud computing does provide a challenge in relation to technical skills and knowhow. These can be overcome through investment in people and training. Alongside that it is also true that many of the established practices of project methodology and leveraging subject matter expertise are unchanged. It is easy to forget this in the excitement and hubris that can sometimes accompany the white heat of technological innovation and change.

What skills are needed to successfully run data science projects in the cloud?

1. Welcome back to the command prompt…

2. You may not be able to bring your old familiar tools with you…

3. What did the IT department ever do for data scientists anyway?

4. Cloud computing changes everything and nothing!

5. Overlook the subject matter experts (SMEs) at your peril…

About The Author

Berni Simmons

Contact us