
By Rohit Kumar, PhD from the Big Data & Data Science Unit at Eurecat
Machine learning model deployment in production as well maintaining the continuous integration and continuous deployment to keep the models updated is a bit more tedious than traditional software/application updates on production.
Like software development life cycle management there are now well-developed methodologies and tools for machine learning life cycle management as well. In this document we are going to cover some of the most common ML model deployment scenarios and go through pros and cons of each one of them which can help someone to decide when to use what approach. We will also present some success stories where we implemented various ML models in production.
Model Training
The first step in all ML projects is the model training part. This involve setting the environment so that the data scientist get access to the required data to test their models on real world data. There are two approaches in model training static models (trained manually and deployed) or dynamic models (models which are retrained in production when new data arrives).
Static Models
Models which are trained once in a controlled environment directly by a data scientist and the final model is deployed in production for use. For example, a Deep learning model to identify human faces, or speech to text conversion model. Such models are generally trained once by the data scientist and deployed in Production for use. The models can be updated in future, but it will always involve a data scientist to retrain it in a development environment and then deploy new model if improved.
Consideration during training: Typically, data scientist uses a platform like Jupyter to train models and experiments before releasing a model for production. The main considerations for this training steps are code versioning and environment conflicts with other projects and teams.
As Jupyter notebooks are different than normal python or app code as internally they are JSON structure which are not easy to understand and read, any code repository diff from current and previous version is very difficult to understand the changes. Some consideration while doing code versioning while using Jupyter notebooks are following:
- Clear Output before commit: Always clear output before committing the notebook to git repo. This removes any binary blobs that have been generated by the notebook.
- Convert to HTML: It is a best practice to convert the Jupyter notebook with results and any graphs into an HTML and commit it along with the Jupyter notebook itself. Specifically, the HTML file should be tagged to the final version of the released model. For example, model_v1.html, model_v2.html for each version of model released to production.
- Convert to python: Apart from the notebook itself it is a good practice to store the exported python code from the notebook as python code making it easier to do a diff in the code to see difference in the code.
Dynamic Models
In some cases, models need to be retrained directly in the production environment using the new data arriving. For example, in churn prediction the model needs to be trained regularly using new customer behavior data to update the prediction of churn prediction. There are two types of setups in Dynamic models:
- Batch models: Models are trained at regular interval using the new data and old data in batch mode. Typical frequencies are daily, weekly, monthly. Multiple approaches exist for continuous batch training like using airflow like orchestrator system to run data pipelines to prepare data and run training at specified intervals and update the model in production. Cloud platforms also provide proprietary tools for managing continues training in production for instance azure data factory can be used in azure cloud to setup a continuous model training in production. Auto-ML tools also help a lot in such situation where the retraining can be more than just model parameter weight training and can be more comprehensive.
- Real time models: There are online machine learning algorithms which can be retrained using only the new data without a full retrain. For example, K-means (through mini-batch), Linear and Logistic Regression (through Stochastic Gradient Descent) as well as Naive Bayes classifier are some algorithms which supports real time model training.
Model Saving
The next main consideration after model training is how the model is packaged and delivered to production environment there are multiple ways for this:
- Pickle: This is the most common used approach which uses python serialization approach to convert the trained model into a bitstream and allows it to be stored to disk and reloaded later. It is a good approach if the intended applications are also built-in Python, and the environment is consistent in development and production.
- ONNX the Open Neural Network Exchange format, is an open format that supports the storing and porting of predictive model across libraries and languages. Most deep learning libraries support it and SKLearn also has a library extension to convert their model to ONNX’s format.
- PMML or Predictive model markup language, is another interchange format for predictive models. Like for ONNX SKLearn also has another library extension for converting the models to PMML format. It has the drawback however of only supporting certain type of prediction models.
Model Deployment
There are two ways ML models are deployed in Production environment: Batch prediction or Real time prediction. The choice of one approach over depends on the specific business use cases. Though real time prediction is most preferred it is important to compare the benefits against the complexity and cost implications that arise from doing real-time predictions.
Batch VS Real Time
- Load Implication: Batch prediction systems are easy to manage from load perspective as there are no unplanned or unknown high demand requirement and the computation can be spread over an acceptable time duration. However, in case of real time prediction there can be instances when suddenly there is a huge demand for high compute. For example, suppose there is a churn prediction model which is in real time setup and suddenly lots of people are calling the service to get the prediction which will result in multiple runs of the prediction pipeline at the same instance. However, in a batch setup there is no surprise load and hence it’s easy to manage and plan.
- Cost Implication: In order to support the load implication of real time prediction system and guarantee the SLAs sophisticated infrastructure is required. This results in a higher cost of maintain and managing the infrastructure.
- Model monitoring Implication: Monitoring the performance of a real time model and evaluating the results is much more complicated than a batch system. They also require a log collection mechanism that allows to both collect the different predictions and features that yielded the score for further evaluation.
Batch model deployment
Deploying a batch prediction model typically consist of two parts:
- Data creation: This step involves creating the required feature sets for the model to make prediction. Different ETL tools can be used for this to get the required data from different sources and prepare a final dataset with all the features required for prediction.
- Running Prediction: This step involved iterating over the input data set created in step 1 and applying the prediction model. The result of the prediction can be stored in the same data set or published to a queue-based system to trigger further downstream actions needed as per the outcome of the prediction.
Figure 1 show a typical data pipeline for batch model deployment.

Fig 1: Batch model deployment pipeline
Real time prediction Model deployment
Real time prediction models require three components to work, a pre trained model to do the prediction, an event which triggers the prediction request and the input feature to do the prediction. There are 4 different ways in which Real Time prediction models deployment
- Database triggers: models are integrated directly with the database and is triggered based on specific data base events like an insert or update.
- Webservice: This is the most common used scenario where a model is deployed using an REST API service.
- Queue based model deployment: This is not a real time but close to near real time where the event is registered in a queue and based on the load it triggers the model.
- inApp: it is also possible to deploy the model directly into a native or web application and have the model be run on local or external datasources.
In the next sections we will discuss in detail the 4 approaches with their pros and cons.
1. Database Trigger based deployment
Now a days most of the database systems are now coming integrated with plugins for analytics and machine learning for example Postgres possess an integration that allows to run Python code as functions or stored procedure called PL/Python. This implementation has access to all the libraries that are part of the PYTHONPATH, and as such can use libraries such as Pandas and SKLearn to run some operations.
Pros: These models are easy to setup as no extra tools or environment is needed and the existing data source system can be used.
Cons: This use case is only advisable when the overall size of database is small (less than 1 Mil records) and the triggers are not very frequent or in high numbers. Otherwise, it can cause too much load on the database system causing operational issues.
2. Web Service deployment
This is one of the most common approaches to deploy real time models. The models are packaged as API service point which can be called by anyone needing the prediction by providing either the complete feature set or a unique id which the service can use to fetch required data build the feature set and then do the prediction.
Pros: Scale very well to any workload.
Cons: High engineering complexity and chances of service failure or service unavailability in high load.
There are two main ways in which these API services can be deployed:
-
2.1. Serverless deployment using Functions
This is available on from cloud providers like AWS, Google, Azure etc. where the models are encapsulated as a function which is triggered on demand. In Figure 2 we show different combination of cloud tools which can be used to achieve this deployment strategy.

Fig 2: Function based real time model deployment in Cloud.
-
2.2. Server based deployment using REST API and docker images
In this approach the model is exposed as a REST API using frameworks in Flask or FastAPI in Python and the deployment is then like any API service-based kind system. Dockerization of the service make the deployment and scalability much easier especially given the complexity of dependency requirements of each model. The Docker images can be deployed in Kubernetes or Any such container orchestration systems to scale the number of instances of the model as demand increase or decrease.
3. Queue based model deployment
In web service-based deployment model one of the issues is the unpredictable nature of load. If suddenly there is too much demand to the REST API the service might crash or many request will time out giving a bad experience to the user. To handle such situation if “real-real” time requirement can be relaxed then Queue based model deployment is very useful.
The only thing we need to be careful in these setups is that there should be some kind of monitoring system to check how much time it is taking to process events in the queue and increase the compute resources accordingly to state in agreed SLAs.
In such setup a queue is created where all the triggers for the prediction is first captured. This queue is continuously polled for any new event and as soon as event comes it is processed. If multiple events come together, they are processed in first come first serve basis. Multiple consumers can be setup to consume and process multiple events in parallel.
In Figure 3 we show different combination of tools which can be used to achieve this deployment strategy such as Kafka with python consumers, Google pub/sub with cloud dataflow or Azure EventHub with Azure functions.

Fig 3: Queue based deployment options
The typical open-source combination that you would find that support this kind of use case in the data ecosystem is a combination of Kafka and Spark streaming, but a different setup is possible on the cloud. On google notably a google pub-sub/dataflow (Beam) provides a good alternative to that combination, on azure a combination of Azure-Service Bus or EventHub and Azure Functions can serve as a good way to consume the messages and generate these predictions.
Success stories
Redemption model deployment scenario
Problem Statement: Deploy a model to predict the redemption probability of a voucher for a customer. The model was needed to be retrained for a specific customer on demand.
Training Deployment: The model was deployed with specific APIs to trigger the retraining. The API was integrated to a dashboard which an admin can use to monitor new data stats and trigger a retraining when needed.

Fig 4: Retraining strategy
Prediction Deployment: The prediction service was deployed using web service approach with a REST API. The model was stored in Google cloud storage along with meta data.

Fig 5: Model deployment for a redemption scenario
Lead probability model deployment
Problem: To train and deploy a model to predict lead probability of a customer during chat interaction.
Training Deployment: Data was being updated into an S3 Data lake which was used on a regular basis to run model training and update the new trained model and save it another S3 bucket. The job was orchestrated through AWS batch platform.
Prediction Deployment: Web service-based deployment using docker images deployed in ECS for high scalability. Models are updated automatically by ECS when the model version changes in the s3 bucket. The prediction service fetches feature data from DynamoDB based on unique customer id to make lead probability prediction.

Fig 6: Model deployment for a lead probability scenario