LinkedIn reports a 56% increase in data scientist job openings in the U.S. over the past year. Glassdoor’s list of Best Jobs in America has had Data Scientist as the top job for the past three years. Indeed has more than 4000 data scientist openings nationwide. How do you take advantage or become part of this hot career path and be the best data scientist you can be?
Because of the quantity of data available and the improved capacity of computers to process it, your programming skills are now as important, if not more important, than your mathematical or statistical skills to be the best data scientist. Topping the list of necessary programming skills is Python, a general purpose, object-oriented programming language that runs on most operating systems. Python has a powerful data visualization tool and a set of libraries that are invaluable for machine learning. R continues to grow in popularity because it is free or an open source statistical software package that greatly simplifies the analysis of large data sets. SQL is essential for retrieving and manipulating large amounts of data. Hadoop is software that stores and processes large volumes of data across clusters of computing devices. It is flexible and scalable and helps identify trends and predict outcomes to improve decision-making. Tableau and Apache Spark are also often mentioned as they supplement, complement or replace some of the capabilities of R, Python and Hadoop.
While the heavy lifting on data processing and analysis is done by computers, it is essential that the data scientist understand what is happening. The amount of data available means there are nearly endless avenues of analyses to pursue. Choosing and managing which avenue to follow is critical decision-making for the best data scientist. Choosing which statistical technique to apply and the limitations and assumptions of those techniques is as important as the processing of the data. Machine learning is the hottest application currently, but logistic regression and neural networks along with cutting edge artificial intelligence like deep learning are all part of the toolkit.
Tech and financial services led the way in utilizing data scientists but now there is demand in almost every industry. Analyzing profit and conversions may be standard across many industries but key performance indicators (KPIs) can vary dramatically. There are unique goals, requirements and limitations by industry and the best data scientist is someone who can find meaningful insight and useful recommendations for an individual business. You must pose the right questions and know how to make the computer help you answer them.
You cannot be the best data scientist without the ability to report technical findings in a comprehensible and compelling manner to non-technical colleagues, managers or clients. Visualization skills are critical to have to make the most interesting and informative presentation. There are great programming resources to do this, but giving it short shrift means you drop the ball before getting to the successful finish of your project.
Practice, Practice, Practice
Analytics Vidha, a community of data practitioners, has a terrific article originally published in 2016 and updated in 2018 that gives you data science projects to practice and potentially showcase your skills. Better yet, it is free and broken down into beginner, intermediate and advanced problems. The data is free and Analytics Vidha provides a tutorial to assist you.
Beginner projects have data that is fairly easy to work with and doesn’t require complex statistics. Some of the questions to be answered include: predict the class of the flower based on available attributes, predict if a loan will get approved or not and predict the traffic on a new mode of transport.
Intermediate problems have more challenging data sets that require serious pattern recognition skills. Problems to be solved include: predict the activity category of a human, classify the documents according to their labels and predict the release year of a song.
Advanced projects have high dimensional data sets and strong creativity is required utilizing advanced analytics like neural networks, deep learning and recommender systems. Problems include: classify the type of sound from the audio, predict the time taken to solve a problem given the current status of the user, and use deep learning techniques to answer open-ended questions about images.