Managers and executives at all levels are now expected to be at least familiar with how machine learning models are built and deployed. However, if you don’t have a formal data science education, reading through industry publications is not very helpful: High-level use case descriptions and marketing materials too often present machine learning as somewhat of a dark magic powering their products; technical publications tend to be incomprehensible for a nonspecialist, and how-to guides simply list the steps without giving sufficient background as to why each step is needed, which limits understanding.
To fill this gap, we outline eight key steps for building and deploying machine learning models to production. Naturally, first you need to clearly define your business problem, identify the data sources needed, determine how you will measure success. Then proceed to the steps below, which guide you through the technical model-building process and provide context for why each step is necessary.
Step 1: Prepare data and create features.
Reason: The data isn’t ready for analytics.
With dozens of customer tables from multiple geographies and databases, supplemented with a multitude of transaction data tables, reference tables, and the like, companies face an overwhelming amount of data — all of which needs to be heavily processed to be useful for analytics. For example, they need to turn raw transactional data into the number of purchases over the past 12 months across the product categories. This process is often called feature engineering and can take months for a complex project.
Step 2: Select features.
Reason: Some of the data is irrelevant and/or redundant.
Coming out of the previous step, you may have a nicely organized table with thousands of columns of features describing customer behavior. But feeding those variables — many of which are irrelevant or repetitive — into a model only adds noise and drastically slows the training process. The key is to prune the features. Usually this requires some combination of business judgment and mathematical techniques, such as correlating inputs to targets or mRMR (Minimum Redundancy Maximum Relevance).
Step 3: Reduce dimensions.
Reason: Too much of the data is similar.
Often, many input variables are highly correlated with one another. For example, in retail, store revenue for a particular day is highly correlated to the number of transactions for that day. Having both variables may not be necessary. Better still, some techniques can convert many partially correlated variables into a smaller set of nearly independent ones. Common techniques for this are principal component analysis (PCA) and deep autoencoder.
Step 4: Choose the appropriate model.
Reason: Different data and target variable types require different models.
Now that the data is prepared, the data scientist needs to pick the appropriate model for the data set and target variable they have. For predicting continuous variables (e.g., revenue next year), categorical (e.g., yes/no), and natural clusters of data (e.g., customer segmentation) are needed. Also, different models may work better with different data sets (e.g., clean/sparse data), so some degree of experimentation may be needed.
Step 5: Tune model parameters.
Reason: Choosing the right parameters for a given problem can significantly improve prediction accuracy.
Some models, such as logistic regression, have few if any parameters, and others, like the popular XGBoost, which is a collection of decision trees, have ten or more. Picking the right parameters often means the difference between mediocre and excellent accuracy. And the only way to know if the parameters are right is to test them, which requires a lot of trial and error as well as the ability to run multiple experiments and quickly review performance results.
Step 6: Train the model on the full data set.
Reason: The bigger the data set you use, the more accurate the results will be.
Training is often done on a subset of data, as some data is set aside for testing. And then, that subset data is often subsampled further to expedite the parameter-tuning process. Once that process is complete and you have the best model with the best parameters, it’s time to “train” the model on the full data set to get the highest possible accuracy.
Step 7: Streamline the move from development to production.
Reason: Not all software and platforms do this the same way, and for many, the process is time consuming and cumbersome.
Once the model is prepared, it needs to be deployed to do “work,” which usually means running it against new data points (a process called scoring). For example, you might have trained an anomaly detection model and now want to score each incoming credit card transaction to detect potential fraud. Depending on the platform and software package you use, this process may be cumbersome. Some users are forced to recode the model into Java or Python to make it run on a parallel Hadoop infrastructure, which is highly inefficient. A smooth, recoding-free, development-to-production process is essential.
Step 8: Monitor ongoing operations.
Reason: It's important to determine whether the model is still producing accurate, relevant results.
Once the model is running, you need to keep a watchful eye on it to make sure it is still performing well. Check to see whether results are drifting, if incoming data distribution is changing significantly, or if the model needs to be retrained. Automatic monitoring and preprogrammed alerts and actions can help.
Conclusion
Overall, if you are a technology leader considering a software platform, choose something that will guide you through and automate the above steps. And if you are a business leader, refer to this guide to ensure the key steps of the process were taken.
Want to learn more about how to implement a streamlined Big Data solution for your company? Download our paper, “Delivering Big Data Success with the Signal Hub Platform.”
Anatoli Olkhovets is vice president of product management at Opera Solutions.