Smith Hanley has been recruiting statisticians for over thirty-five years. We have evolved along with our clients and candidates from traditional marketing and business analytic roles to full-fledged data scientist positions. The evolution required? The data scientist technology skill set that has to be added to your core statistical skills.
What technology is the most in-demand? Hadoop, of course. Hadoop’s distributed file system (HDFS) is the foundation for most of today’s big data projects. HDFS allows vast multi-petabyte datasets to be stored across an almost infinite number of everyday computer hard drives. It is FLEXIBLE – can handle multiple data formats. It is SCALABLE – can accommodate small to very large workloads. It is AFFORDABLE – it is open source and allows even small organizations with modest budgets to reap big data benefits.
Then why am I hearing so much about Spark? Spark is growing exponentially in popularity due to the speed with which it processes data. Spark uses in-memory processing which allows real-time processing. Data can be fed into an analytical application the moment it is captured and insights fed back to the user through a dashboard, allowing action to be taken. Spark does not have a distributed file system so it “needs” Hadoop to hold the data. Spark does compete with MapReduce.
MapReduce does not have an interactive mode but add-ons like Hive and Pig can make working with MapReduce easier. MapReduce also does batch processing of data versus Spark’s in-memory RAM processing. Spark can perform batch processing but really excels at streaming workloads, interactive queries and machine based learning. By processing through machine based learning Spark can be 100x faster than MapReduce.
SQL, Python, R, SAS
What software is missing from this discussion of data scientist technology needed to succeed? As Business News Daily said, “Every company today that gathers data needs somebody who is able to utilize SQL to quickly pull out key data components and generate reports that aid the decision making process.” Python has a unique combination of being both a capable general-purpose programming language as well as being easy to use for analytical computing. And, don’t forget the rise of open-source R and the historical strength of SAS for doing statistical analysis.
And this list will be added to and changed over the next six months. We guarantee it! Want some help keeping pace with the data science career track? Contact the Data Science Executive Recruiters at Smith Hanley Associates: Nancy Darian, Nihar Parikh, Paul Chatlos, Eda Zullo and Rory Hauser .