Which data science tools should you learn?

I’ve blogged several times now about different aspects of data science. A conversation I’ve been having more and more frequently now is about what tools people should learn if they’re hoping to develop a career in data science. Obviously there are many different factors to be taken into account here.

You’ll want to think about whether there’s a tool that’s the standard in your particular industry. You’ll also want to consider whether you want to specialize in a particular area of data science and build a reputation as an expert in a range of related tools, or whether you’d prefer to work as a generalist with expertise across a wider range of different kinds of tools. Then of course there’s the question of money. Which tools are most richly rewarded in terms of salary?

With that in mind I recently came across this article on Tech Republic with addresses that very question. It’s a couple of years old now but in my experience that core arguments that it puts forward still applies. The key points are as follows:

  • Analysts with skills in open source tools tend to be paid more than those who can only use proprietary commercial tools.
  • The most commonly used tools are SQL, Excel, R and Python – O’Reilly’s 2014 data science salary survey found that each of these four tools were used by over 50% of their sample of respondents.
  • The more tools you can use, the higher the salary you’ll be able to command. Data scientists familiar with between one and five tools had a mean salary of just over $70,000 – this doubles for those who can use more than 20 tools.
  • Broadly speaking the tools that data scientists use are split into two clusters – a Microsoft / Excel / SQL cluster and a Hadoop / Python / R cluster. It’s unusual for people to only be able to use tools from one of these clusters, but there is a clear tendency for people to specialize in one group or the other. The tools in the Hadoop cluster all tend to be open source whilst those in the Microsoft cluster are much more likely to be proprietary tools.
  • The other distinction between the two clusters is that the tools in the Hadoop cluster tend to be those that enable analysts to get to grips with very large datasets. As the cost of collecting data falls and the computing power required to analyse it becomes more widely available, this has a knock on effect on the skills required. Companies are collecting massive volumes of data as a matter of course now, but they’re struggling to make sense of it. Data scientists with the skills required to work with huge datasets have a clear advantage in the market.
  • Those data scientists who specialize in tools from the Hadoop cluster tend to be more highly paid that those whose skills are focused on the Microsoft cluster. However, that’s also influenced by the fact that people who use one tool in the Hadoop cluster are much more likely to be able to use several of those tools. Those data scientists who use the highest number of tools are much more likely to be focused on the Hadoop cluster rather than on the Microsoft cluster.

It seems that the highest paying jobs tend to be those that require high level open source expertise, particularly using tools such as Hadoop and R. It’s also clear that if you’re thinking of learning one of these tools with a very to improving your employability and your earning power then it’s probably worth learning more than one. The correlation between number of tools and salary is clear.

Scroll to Top