Lessons from Deploying a Machine Learning Model with Kubernetes to Google Cloud Platform
Deployments can be hard. Especially when working with resource intensive processes like machine learning models and ensuring they run optimally. After quite a bit of challenges, the following is an example illustration of a few actions taken to deploy a containerized model to Google Cloud Platform (GCP) in a relatively framework-agnostic manner using Kubernetes.
A Quick Brief
In the case outlined, the model and associated API are written in Python with the Theano library and served using the Flask web application framework. The model itself requires a few large asset files to be loaded into memory for it to encode and process the input data.
Delegating most of the steps necessary for deploying a standard container with Google Kubernetes Engine (GKE) to the official documentation on the matter, this article mostly focuses on the techniques associated with overcoming the major pain points throughout the deployment process. Techniques, by which, have been organized below in a manner aligned with areas of concern rather than sequential steps.
Preparing the Application
Before deploying the container to GCP, two methods or routes must be added to the Flask application to ensure our model is loaded on startup or the time in which the pod is provisioned when all Kubernetes health checks pass.
The first method
_load_model() is triggered prior to the execution of the first request made to the API. In this example, the Kubernetes Ingress, discussed later, will perform a health check readiness probe on the
... @app.before_first_request def _load_model(): global MODEL # Note: In GKE, health check triggers loading of the model MODEL = load_model(_logger=app.logger) @app.route('/healthz', methods=['GET']) def health(): """Health check probe route for Kubernetes Ingress""" return jsonify(code='200', message='Ok') ...
By storing the model returned from the
load_model method in a
global variable, all subsequent requests to this pod will utilize it preloaded from memory, reducing request times to the API by approximately
Storing the Model Assets
load_model method reads in a few sizeable Pickle (
.pkl) files (approx.
2.5 GB) used for encoding the input data, granting the application with file system access to these assets is required.
Preceding storage setups included the following schemes, yet neither seemed to provide a successful solution.
Attempting to mount a Google Cloud Storage (GCS) bucket containing the files from within the
gcsfuseresulted in a connectivity nightmare.
Downloading the same model assets on application start using the Python library for Cloud Storage was sluggish, provided limited reliability, and was required with each restart or provisioning of a new pod.
Persistent Volumes, Docker exec, and gsutil
As a result of a bit of trial-and-error, the remedying solution included mounting a persistent volume in Kubernetes, using
docker exec to run
gsutil within the container, and manually conducting a one-time transfer of the model files from the bucket to the volume.
Provisioning a persistent volume requires creating a
PersistentVolumeClaim as a prerequisite to creating the pod that will inevitably utilize it. By employing
kubectl to deploy new resources, executing the below command on the following
pvc.yaml configuration file will accomplish the desired result.
kubectl create -f pvc.yaml
# pvc.yaml --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-disk spec: accessModes: - ReadWriteOnce resources: requests: storage: 30Gi
Creating a pod for the container application within the Kubernetes cluster follows the same provisioning step but against a separate
api-deployment.yaml configuration file.
Potential configuration setups may benefit from the usage of a single deployment file for all objects. In the instance discussed, they are split into separate files for clarity.
Generated initially from a
docker-compose.yaml file using
kompose, the further modified and slightly reduced example declares:
- Labels and selectors to point the
NodePortservice, to be defined later, to the new
- Environment variables representing
- the bucket name containing the model files
- the path by which the model files will be copied to and later read from
- A persistent volume claim and the
mountPathfor which to mount said volume
- The resources to be allocated to each pod
- A route for the readiness probe in which GKE will perform the health checks on
…along with additional, more standard deployment configuration settings.
kubectl create -f api-deployment.yaml
# api-deployment.yaml --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: api-deployment spec: selector: matchLabels: io.kompose.service: api replicas: 1 template: metadata: labels: io.kompose.service: api spec: containers: - name: my-api image: gcr.io/my-project/my-image ports: - containerPort: 5900 env: - name: GCS_BUCKET value: my-models - name: MODELS_PATH value: /mnt/disk/models/ volumeMounts: - mountPath: "/mnt/disk" name: my-disk resources: requests: memory: "8096Mi" cpu: "500m" limits: memory: "16384Mi" cpu: "1000m" readinessProbe: httpGet: path: /healthz port: 5900 volumes: - name: my-disk persistentVolumeClaim: claimName: my-disk restartPolicy: Always
CUDA-Enabled Container Image
As an additional note, the
Dockerfile used for building
my-image is a derivative of the
nvidia/cuda:9.0-cudnn7-runtime container image with Python and the associated package requirements installed.
Transferring the Model Files
With the initial pod and volume created, transferring the models, currently stored locally, to the GCS bucket and then to the persistent volume can all be done with the following sets of commands.
Local Machine to Bucket
Recursively copying the large files in a multithreaded manner to ensure the most efficient transfer.
gsutil -m cp -r models gs://my-models
Get Pod Name
Retrieving the name of the created pod for which to execute the transfer in.
kubectl get pods
Bucket Through Pod to Persistent Volume
bash within the pod to install
gsutil and to copy the files from the bucket to the persistent volume.
# Run `bash` within pod kubectl exec -it <POD_NAME> -- /bin/bash
# Install `gsutil` and configure `gcloud` with the project curl https://sdk.cloud.google.com | bash exec -l $SHELL gcloud init <PROJECT_NAME> # Create the path within the persistent volume to transfer the files to mkdir -p $MODELS_PATH gsutil rsync gs://$GCS_BUCKET $MODELS_PATH
Now, the model files stored in the persistent volume are accessible by all current and future pods that are provisioned.
Exposing the API
The combination of a
Ingress service, and global static IP will reliably expose the cluster publicly, with essential load balancing and SSL encryption.
NodePort service for the API uses the same deployment process from prior but with the following
api-service.yaml configuration file.
kubectl create -f api-service.yaml
# api-service.yaml --- apiVersion: v1 kind: Service metadata: name: api labels: io.kompose.service: api spec: type: NodePort selector: io.kompose.service: api ports: - name: "5900" port: 5900 targetPort: 5900
Notice the metadata and selector fields match that of the previously deployed pod to establish a connection between the two resources.
Static IP Addressing
A new global IP address may be created to configure and expose the cluster to the outside.
gcloud compute addresses create my-dev-ip --global
kubernetes.io/ingress.global-static-ip-name annotation in the
ingress-service.yaml deployment file, the newly created global static IP address will be linked to the ingress resource.
# ingress-service.yaml --- apiVersion: "extensions/v1beta1" kind: "Ingress" metadata: name: "ingress" annotations: kubernetes.io/ingress.global-static-ip-name: "my-dev-ip" ingress.gcp.kubernetes.io/pre-shared-cert: "ingress-cert" spec: rules: - http: paths: - path: /* backend: serviceName: "api" servicePort: 5900
Naturally, the annotation with a key containing
pre-shared-cert will specify the SSL/TLS certificate by the name of
# Adds the converted origin certficate gcloud compute ssl-certificates create ingress-cert \ --certificate ./certs/<DOMAIN>.pem \ --private-key ./certs/<DOMAIN>.key
Deploying the Ingress will trigger the readiness probe hitting the health check endpoint and loading the model into memory.
kubectl create -f ingress-service.yml # Verify exposed IP address is ready kubectl get ingress
A final step would be to include pointing the domain’s
A records in Cloudflare to the global static IP address.
Pricing, Machine-Types, and Provisioning Resource Pools
Often the cost associated with operating these services is omitted when discussing deployment approaches. Including it as a consideration is imperative context for choosing the most optimal machine-types.
Understanding Instance Billing
One of the more precarious areas to navigate is the instance pricing. Running
N1 machine-types with “Predefined vCPUs” costs approximately
$0.038 / vCPU hour and
$0.005 / GB hour for “Predefined Memory.” By these combined measures, an
n1-standard-1 (1 vCPU and 3.75 GB) machine-type will run at approximately
$0.057 / hour.
Monitoring memory usage of the application running locally, may establish the minimum resources necessary to provision the initial pool for sustainable use.
By the nature of how this particular example application operates, a higher system-memory-to-vCPU ratio will help the model to perform more efficiently while not underutilizing paid for CPU resources.
Through creating a pool with an
n1-highmem-2 (2 vCPU and 13 GB memory) machine-type, above standard memory is allocated to each node provisioned in the cluster running at approximately
$0.1421 / hour.
gcloud container node-pools create my-pool \ --cluster=my-cluster \ --machine-type=n1-highmem-2 \ --num-nodes 1 --enable-autoscaling --min-nodes 1 --max-nodes 3
Note: Although untested directly, if the budget will support a node pool equipped with a GPU accelerator such as
nvidia-tesla-t4 (1 GPU and 16 GB GDDR6 memory) at
$0.95 / hour, processing time on the model may see a significant reduction when applied in conjunction with CUDA using the appropriate Theano configuration (
Migrating Resource Pools
Migrating from one resource pool to a new will require cordoning and evicting the deprecated pool, providing a seamless transition when upscaling or downscaling without incurring any downtime.
Touching on it briefly the following two commands will accomplish this task, where
default-pool is the name of the deprecated pool.
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do kubectl cordon "$node"; done
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node"; done
Certainly, managed services, dedicated to machine learning (e.g. Google ML Engine) are improving by the moment, but in the current state, the aforementioned is believed to be one of the few reliable strategies for production deployments.
Quite a lot of ground has been covered rather quickly, but I hope to have provided insight into the decision-making process of deploying a containerized application running an ML model on Google Cloud Platform.
As always, open and welcome to questions and comments!