How to make the move from Data Analyst to Data Scientist
Given my background in customer analytics, this post will come from that angle, but similar inferences can be made to other industries such as insurance, credit risk, fraud etc
Whilst there are many article detailing “Data Analysts vs Data Scientists”, what I really want to do in this post is to highlight how someone in the Data Analytics field can transition into Data Science.
It is ultimately quite difficult to define these two fields as although they may have started off quite distinct, they are now merging into one (whether you agree with it or not), and a lot of companies are using the two terms synonymously, whilst others are keeping a clear distinction. So for the purposes of this post, I will try and define these two fields with the disclaimer that it may well differ for each individual company/role, but will do so in a way that the industry is beginning to define these as (particularly noticeable from a recruitment angle).
Who are/were the traditional Data Analyst?
*n.b. All these role titles could also be applied to Data Scientists depending on what they do!
Creation of bespoke statistical models
Production of Insights
SAS: Base, Enterprise Guide
SPSS (Base, Clementine aka IBM Modeller)
Distinction between Analysts and Modellers
The reason to highlight the difference between these two is quite important as the latter type of analyst have transitioned relatively seamlessly into Data Scientists (for the most part) whilst the former have generally remained in the same camp.
The distinction between Data Analysts and Modellers tends to be made by larger consultancies/in-house teams, where analysts do some modelling, but for the most part tend to focus on answering short term business questions or prioritise BAU. When modelling is done it is for a very specific purposes e.g. campaign analysts building predictive model for specific campaign. However depending on a company’s analytics structure, sometimes these modelling tasks would be passed onto modelling teams. Although some statistics are applied in their roles, it tends to be quite basic.
These analysts tend to focus on longer term projects. A lot of firms fit these analysts into an “Advanced Analytics” team. These teams deal mostly in creating bespoke statistical/mathematical models, and are touted as being more statistically minded and using the latest techniques and software.
Emergence of Data Scientists
This question of “What are Data Scientists?” has probably been answered to death in multiple articles and blog posts, so I won't try to define these, but instead highlight how the market transitioned and is still transitioning to Data Science.
I started to notice something interesting in the job market around 2010. Jobs started to be advertised for “Data Scientists”. These jobs were asking for Mathematics/Physics Phd grads who have experience with advance predictive modelling, as well as experience with programming software such as R, Python, Mathematica, Matlab (software we used at university, but never really encountered in the industry) and other buzzwords such as “Machine Learning”, “Big Data” and “Hadoop”.
A lot of these terms were quite alien to some but then things began to change. The nice, clever guys who were developing all these models in R/Python were releasing the libraries, with all the developed models, online (the benefit of open source!) and all of a sudden we were all able to use machine learning techniques and the number of Data Scientists began to increase rapidly!
Now a lot of Modelling/Advance Analytics teams were turning into Data Science teams, and even some Data Analysts were being being given the title Data Scientists and the two terms started to become synonymous in some companies.
However some clear distinctions still separated Data Analysts and Data Scientists:
Probably the main distinguishing feature for most roles as traditional analytics tools such as SAS and many SQL variations cannot do the functionality required for statistical techniques that require greater computational power. So languages such as R and Python have become the go to languages for Data Scientists.
The above tools are necessary when working in the cloud environment. Running more complex models requires better computational power so using cloud services and systems such as Tensorflow are necessary.
Machine Learning Algorithms:
Newer techniques that could not be run without the above tools/environments were now able to be produced for any problem. These algorithms are more complex than traditional modelling techniques (Logistic Regression, k-means clustering, CHAID decision trees) however followed similar approaches in application.
4 key ways to transition into Data Science
Learn why the algorithms and environment are different and how revolutionary they are. There are a plethora of blogs and videos online to look through:
This is a fast moving industry, so you need to keep up! It will work for you if Data Science interests you so much that you read up and learn in your spare time.
R and Python are the two main languages you should learn. Although recent news suggest Python is where it is at, so worth learning that as a priority.
In terms of what to learn, then I would suggest learn how to use these languages in a framework such as Hadoop (specific free courses are available online) and learn how to code with libraries you will use in industry.
The following are good ones to start with:
Plyr - data manipulation
Ggplot2 - data visualisation
For specific machine learning algorithms libraries in R, there are many different libraries to use and can be found quite simply
NumPy/Pandas - important libraries to learn when handling data
Sci-Kit learn - contains a lot of the popular machine learning algorithms
Matplotlib - data visualisation library
Another important thing to learn is about how data is being computed. This is something an analyst never really thought much about before, but how we run the data is almost just as important as what algorithms are being applied to it, as the latter requires the former. E.g. if we are looking into running a deep learning algorithm then we may require running it via a Tensorflow library.
Probably the most important in my opinion, and the way to do that is using competition data such as Kaggle. Thanks to these competitions, everyone now has access to great data-sets to test and learn algorithms on. We can also see the winning results for the competitions so have all the resources necessary to learn with.
Also, the company where you work could offer opportunities to apply the latest methodologies/tools you have learnt or read about. Being proactive in suggesting or applying these in your team could a) Help you learn a lot better b) Help increase your credibility around Data Science in your team c) Could kick-start or progress the transition into a more Data Science oriented approach to the company, if they haven't done so already.
It is also important you set up a GitHub account. GitHub is an online repository of project work. Keeping all your competition or other work, and having regular update is an easy way to highlight your ability and acts as an online CV (some prospective employers may even ask to see it!).
4. Find the right job:
If you’re looking to make this transition, then there is good news: a lot of companies are doing the same transition as you! They have realised moving their data to a cloud environment is better and cheaper, but there is a transition period to do this, so they require analysts with SQL, SAS as well as knowing a bit of R/Python (not asking for too much!). Getting into these roles will make this transition very straightforward for you as both you and your employer are on the same journey.
A lot of descriptions of Data Scientists state that they are hybrids of software engineers/coders and statisticians. Although correct to some degree, (and this may be a contentious point) I feel this definition is applied more to what Data Scientists originally were, but in the current climate, having an overall knowledge of different techniques and how to apply them to different problems is more important (not too different to what was needed for most data analysts). Unless you go in for a job that requires it, it is very unlikely you will have to code up an algorithm from scratch! It is more likely that you will need to find an appropriate model to apply to your problem, find relevant library that supports it, and derive a solution. That's not to say you shouldn't know what the model is doing. You need to be able to justify why you chose one algorithm over another, and relate that to your results.
Looking at traditional analytics, if you enjoyed modelling, chances are you will enjoy Data Science.
It is also worth considering the following:
Career aspirations - do you want to be a hands on modeller or do you want to move into more managerial/consultant roles? If the latter then you may not need to focus on learning and applying algorithms, but more on knowing how the latest tech can be applied to different problems
Interest in Data Science and modelling - this is key to any career, but if you have little or no interest in this field then you may want to consider whether making this transition is right for you. As mentioned above, this is fast moving industry with new libraries/techniques/tools being released regularly, so you need to keep up with all the latest news and techniques, and this is best done in our non-working times. So if you don't have an interest in this area, it will be difficult to motivate yourself to keep up to date with everything that's going on!
Suhaib Qazi is a freelance analytical and Data Science consultant with over 10 years experience in customer analytics