Michael Li and Matt Maccaux of O’Reilly.com used two terms that perfectly define the struggle firms are having in building a data science function and monetizing that work:
1. Organizational Maturity
2. Scalability Limitations
As recruiters of data scientists we have many clients interested in doing advanced artificial intelligence or machine learning work before they’ve even organized their data or hired their first data scientist. Expectations of improved efficiency, increased revenue or reduced costs can only happen once the right tools and people are in place.
Data Scientist’s number one complaint is “the data is a mess.” This is not just about missing values or lack of coherency between databases. The availability of seemingly limitless data becomes too overwhelming and complex to understand at first glance even by experienced data scientists. Working with data that has already been aggregated into useful variables is far easier than a table of every action a user has ever taken on a site.
Another complaint, often by senior management, is, “We have a lot of data and we are not doing anything with it.” This can be a problem if the data scientists have to spend all their time organizing the data, processing the data, trying different models and fine tuning the models, versus formulating questions to answer business problems.
Kaylan Veeramachaneni, Principal Research Scientist at MIT, in a Harvard Business Review article offered four principles for creating true impact from data science:
Simple Models
In their research at MIT they discovered that simple models, like logistic regression or those based on random forests or decision trees, were sufficient to answer most problems. Keeping it simple meant the time between data acquisition and the development of the first predictive model was reduced.
Exploring More Problems
“Instead of exploring one business problem with an incredibly sophisticated machine learning model, companies should be exploring dozens, building a simple predictive model for each one and assessing their value proposition.”
Use Sample Data
“Circumventing the use of massive computing resources, will enable the exploration of more hypotheses.”
Focus on Automation
“To achieve both reduced time to first model and increased rate of exploration, companies must automate processes that are normally done manually. Over and over across different data problems, we found ourselves applying similar data processing techniques…streamline these, and develop algorithms and software systems that do them automatically.”