Lessons from Deploying a Machine Learning Model with Kubernetes to Google Cloud Platform
Deployments can be hard. Especially when working with resource intensive processes like machine learning models and ensuring they run optimally. After quite a bit of challenges, the following is an example illustration of a few actions taken to deploy a containerized model to Google Cloud Platform (GCP) in a relatively framework-agnostic manner using Kubernetes.
A Quick Brief
In the case outlined, the model and associated API are written in Python with the Theano library and served using the Flask web application framework. The model itself requires a few large asset files to be loaded into memory for it to encode and process the input data.
Delegating most of the steps necessary for deploying a standard container with Google Kubernetes Engine (GKE) to the official documentation on the matter, this article mostly focuses on the techniques associated with overcoming the major pain points throughout the deployment process. Techniques, by which, have been organized below in a manner aligned with areas of concern rather than sequential steps.
Preparing the Application
Before deploying the container to GCP, two methods or routes must be added to the Flask application to ensure our model is loaded on startup or the time in which the pod is provisioned when all Kubernetes health checks pass.
The first method _load_model()
is triggered prior to the execution of the first request made to the API. In this example, the Kubernetes Ingress, discussed later, will perform a health check readiness probe on the /healthz
endpoint.
...
@app.before_first_request
def _load_model():
global MODEL
# Note: In GKE, health check triggers loading of the model
MODEL = load_model(_logger=app.logger)
@app.route('/healthz', methods=['GET'])
def health():
"""Health check probe route for Kubernetes Ingress"""
return jsonify(code='200', message='Ok')
...
By storing the model returned from the load_model
method in a global
variable, all subsequent requests to this pod will utilize it preloaded from memory, reducing request times to the API by approximately 60sec
.
Storing the Model Assets
Since the load_model
method reads in a few sizeable Pickle (.pkl
) files (approx. 2.5 GB
) used for encoding the input data, granting the application with file system access to these assets is required.
Preceding storage setups included the following schemes, yet neither seemed to provide a successful solution.
Attempting to mount a Google Cloud Storage (GCS) bucket containing the files from within the
Dockerfile
usinggcsfuse
resulted in a connectivity nightmare.Downloading the same model assets on application start using the Python library for Cloud Storage was sluggish, provided limited reliability, and was required with each restart or provisioning of a new pod.
Persistent Volumes, Docker exec, and gsutil
As a result of a bit of trial-and-error, the remedying solution included mounting a persistent volume in Kubernetes, using docker exec
to run gsutil
within the container, and manually conducting a one-time transfer of the model files from the bucket to the volume.
Persistent Volumes
Provisioning a persistent volume requires creating a PersistentVolumeClaim
as a prerequisite to creating the pod that will inevitably utilize it. By employing kubectl
to deploy new resources, executing the below command on the following pvc.yaml
configuration file will accomplish the desired result.
kubectl create -f pvc.yaml
# pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-disk
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
Pod Deployment
Creating a pod for the container application within the Kubernetes cluster follows the same provisioning step but against a separate api-deployment.yaml
configuration file.
Potential configuration setups may benefit from the usage of a single deployment file for all objects. In the instance discussed, they are split into separate files for clarity.
Generated initially from a docker-compose.yaml
file using kompose
, the further modified and slightly reduced example declares:
- Labels and selectors to point the
NodePort
service, to be defined later, to the newDeployment
pod - Environment variables representing
- the bucket name containing the model files
- the path by which the model files will be copied to and later read from
- A persistent volume claim and the
mountPath
for which to mount said volume - The resources to be allocated to each pod
- A route for the readiness probe in which GKE will perform the health checks on
…along with additional, more standard deployment configuration settings.
kubectl create -f api-deployment.yaml
# api-deployment.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: api-deployment
spec:
selector:
matchLabels:
io.kompose.service: api
replicas: 1
template:
metadata:
labels:
io.kompose.service: api
spec:
containers:
- name: my-api
image: gcr.io/my-project/my-image
ports:
- containerPort: 5900
env:
- name: GCS_BUCKET
value: my-models
- name: MODELS_PATH
value: /mnt/disk/models/
volumeMounts:
- mountPath: "/mnt/disk"
name: my-disk
resources:
requests:
memory: "8096Mi"
cpu: "500m"
limits:
memory: "16384Mi"
cpu: "1000m"
readinessProbe:
httpGet:
path: /healthz
port: 5900
volumes:
- name: my-disk
persistentVolumeClaim:
claimName: my-disk
restartPolicy: Always
CUDA-Enabled Container Image
As an additional note, the Dockerfile
used for building my-image
is a derivative of the nvidia/cuda:9.0-cudnn7-runtime
container image with Python and the associated package requirements installed.
Transferring the Model Files
With the initial pod and volume created, transferring the models, currently stored locally, to the GCS bucket and then to the persistent volume can all be done with the following sets of commands.
Local Machine to Bucket
Recursively copying the large files in a multithreaded manner to ensure the most efficient transfer.
gsutil -m cp -r models gs://my-models
Get Pod Name
Retrieving the name of the created pod for which to execute the transfer in.
kubectl get pods
Bucket Through Pod to Persistent Volume
Running bash
within the pod to install gsutil
and to copy the files from the bucket to the persistent volume.
# Run `bash` within pod
kubectl exec -it <POD_NAME> -- /bin/bash
# Install `gsutil` and configure `gcloud` with the project
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
<PROJECT_NAME>
# Create the path within the persistent volume to transfer the files to
mkdir -p $MODELS_PATH
gsutil rsync gs://$GCS_BUCKET $MODELS_PATH
Now, the model files stored in the persistent volume are accessible by all current and future pods that are provisioned.
Exposing the API
The combination of a NodePort
service, Ingress
service, and global static IP will reliably expose the cluster publicly, with essential load balancing and SSL encryption.
NodePort Deployment
Creating a NodePort
service for the API uses the same deployment process from prior but with the following api-service.yaml
configuration file.
kubectl create -f api-service.yaml
# api-service.yaml
---
apiVersion: v1
kind: Service
metadata:
name: api
labels:
io.kompose.service: api
spec:
type: NodePort
selector:
io.kompose.service: api
ports:
- name: "5900"
port: 5900
targetPort: 5900
Notice the metadata and selector fields match that of the previously deployed pod to establish a connection between the two resources.
Static IP Addressing
A new global IP address may be created to configure and expose the cluster to the outside.
gcloud compute addresses create my-dev-ip --global
Ingress Deployment
Including the kubernetes.io/ingress.global-static-ip-name
annotation in the ingress-service.yaml
deployment file, the newly created global static IP address will be linked to the ingress resource.
# ingress-service.yaml
---
apiVersion: "extensions/v1beta1"
kind: "Ingress"
metadata:
name: "ingress"
annotations:
kubernetes.io/ingress.global-static-ip-name: "my-dev-ip"
ingress.gcp.kubernetes.io/pre-shared-cert: "ingress-cert"
spec:
rules:
- http:
paths:
- path: /*
backend:
serviceName: "api"
servicePort: 5900
Naturally, the annotation with a key containing pre-shared-cert
will specify the SSL/TLS certificate by the name of ingress-cert
.
If Cloudflare happens to be the DNS provider of choice, an Origin CA certificate can be created, exported, converted using OpenSSL, and then added to GCP with the equivalent ingress-cert
name.
# Adds the converted origin certficate
gcloud compute ssl-certificates create ingress-cert \
--certificate ./certs/<DOMAIN>.pem \
--private-key ./certs/<DOMAIN>.key
Deploying the Ingress will trigger the readiness probe hitting the health check endpoint and loading the model into memory.
kubectl create -f ingress-service.yml
# Verify exposed IP address is ready
kubectl get ingress
A final step would be to include pointing the domain’s A
records in Cloudflare to the global static IP address.
Pricing, Machine-Types, and Provisioning Resource Pools
Often the cost associated with operating these services is omitted when discussing deployment approaches. Including it as a consideration is imperative context for choosing the most optimal machine-types.
Understanding Instance Billing
One of the more precarious areas to navigate is the instance pricing. Running N1
machine-types with “Predefined vCPUs” costs approximately $0.038 / vCPU hour
and $0.005 / GB hour
for “Predefined Memory.” By these combined measures, an n1-standard-1
(1 vCPU and 3.75 GB) machine-type will run at approximately $0.057 / hour
.
High-Memory Machine-Types
Monitoring memory usage of the application running locally, may establish the minimum resources necessary to provision the initial pool for sustainable use.
By the nature of how this particular example application operates, a higher system-memory-to-vCPU ratio will help the model to perform more efficiently while not underutilizing paid for CPU resources.
Through creating a pool with an n1-highmem-2
(2 vCPU and 13 GB memory) machine-type, above standard memory is allocated to each node provisioned in the cluster running at approximately $0.1421 / hour
.
gcloud container node-pools create my-pool \
--cluster=my-cluster \
--machine-type=n1-highmem-2 \
--num-nodes 1 --enable-autoscaling
--min-nodes 1 --max-nodes 3
Note: Although untested directly, if the budget will support a node pool equipped with a GPU accelerator such as nvidia-tesla-t4
(1 GPU and 16 GB GDDR6 memory) at $0.95 / hour
, processing time on the model may see a significant reduction when applied in conjunction with CUDA using the appropriate Theano configuration (.theanrc
).
Migrating Resource Pools
Migrating from one resource pool to a new will require cordoning and evicting the deprecated pool, providing a seamless transition when upscaling or downscaling without incurring any downtime.
Touching on it briefly the following two commands will accomplish this task, where default-pool
is the name of the deprecated pool.
Cordon Pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl cordon "$node";
done
Evict Pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
Other Solutions?
Certainly, managed services, dedicated to machine learning (e.g. Google ML Engine) are improving by the moment, but in the current state, the aforementioned is believed to be one of the few reliable strategies for production deployments.
Quite a lot of ground has been covered rather quickly, but I hope to have provided insight into the decision-making process of deploying a containerized application running an ML model on Google Cloud Platform.
As always, open and welcome to questions and comments!