Model Deployment & Serving

This document describes how trained models are automatically deployed and served in production. Trainings are scheduled to run bi-weekly. The entire process is described in three sections:

Model deployment
Service deployment
Model serving

Model Deployment

The deployment process is as follows:

Train writes a new model to the jobs bucket, e.g., gs://deepdesk-nl-production-jobs-anwb/anwb-verzekeren-chat/20220915-text-pipeline-d2kff/classifier/MLPTF_300_TfIdf_Splt_Agent/1.
Deploy Models copies the directory to the servables bucket, e.g., gs://deepdesk-nl-production-servables-anwb/models/anwb-verzekeren-chat-response-recommender/<new-version>, and creates a new MLModel in Admin.
Pending Models periodically checks the model, and when redactions have been done, sets the model with <new-version> as Profile.active_text_model, moving the current active model to Profile.active_text_model_b.
Pending Models then triggers a deployment.

The following diagram illustrates the model deployment:

Model Deployment 1/2

Service Deployment

The service deployment is as follows:

Admin exports the config and commits to the GitHub deepdesk-config repo.
FluxCD, running in GKE, is triggered by the GitHub commit, and creates Kubernetes ConfigMaps, which in turn trigger a Helm install of the applications using these new ConfigMaps.
A new deployment of Engine is created, with the new config.
A new deployment of Tensorflow is created, with the same config in a different format.

The following diagram illustrates the service deployment. See FluxCD Deployment Documentation for more detailed information about the FluxCD deployment.

Model Deployment 2/2

Model Serving

Recommendations are served as follows:

Frontend requests text recommendations from the backend endpoint /api/conversations/<uuid>/recommend/free.
Backend determines the profile from the conversation and forwards the request to the corresponding Engine, at endpoint /api/conversations/<uuid>/recommend/free.
Engine loads the conversation text, converts the text to vectors using the locally loaded vectorizer, and forwards the request to Tensorflow, using the model name and version label for the endpoint, e.g., /ziggo-response-recommender/78.
Engine takes the top 5 predicted classes, localizes the redacted texts, and returns those to Backend.
Backend serves the list of recommended texts.
Frontend displays the recommendations.