PatMigliaccio

I'm Pat, a software developer. You'll find a collection of articles I've written as well as open-source code I've worked on here.

Operating at the moment.

Lessons from Deploying a Machine Learning Model with Kubernetes to Google Cloud Platform

October 20, 2019

Deployments can be hard. Especially when working with resource intensive processes like machine learning models and ensuring they run optimally. After quite a bit of challenges, the following is an example illustration of a few actions taken to deploy a containerized model to Google Cloud Platform (GCP) in a relatively framework-agnostic manner using Kubernetes.

A Quick Brief

In the case outlined, the model and associated API are written in Python with the Theano library and served using the Flask web application framework. The model itself requires a few large asset files to be loaded into memory for it to encode and process the input data.

Delegating most of the steps necessary for deploying a standard container with Google Kubernetes Engine (GKE) to the official documentation on the matter, this article mostly focuses on the techniques associated with overcoming the major pain points throughout the deployment process. Techniques, by which, have been organized below in a manner aligned with areas of concern rather than sequential steps.

Preparing the Application

Before deploying the container to GCP, two methods or routes must be added to the Flask application to ensure our model is loaded on startup or the time in which the pod is provisioned when all Kubernetes health checks pass.

The first method _load_model() is triggered prior to the execution of the first request made to the API. In this example, the Kubernetes Ingress, discussed later, will perform a health check readiness probe on the /healthz endpoint.

...

@app.before_first_request
def _load_model():
  global MODEL
  # Note: In GKE, health check triggers loading of the model

  MODEL = load_model(_logger=app.logger)

@app.route('/healthz', methods=['GET'])
def health():
  """Health check probe route for Kubernetes Ingress"""
  return jsonify(code='200', message='Ok')

...

By storing the model returned from the load_model method in a global variable, all subsequent requests to this pod will utilize it preloaded from memory, reducing request times to the API by approximately 60sec.

Storing the Model Assets

Since the load_model method reads in a few sizeable Pickle (.pkl) files (approx. 2.5 GB) used for encoding the input data, granting the application with file system access to these assets is required.

Preceding storage setups included the following schemes, yet neither seemed to provide a successful solution.

  • Attempting to mount a Google Cloud Storage (GCS) bucket containing the files from within the Dockerfile using gcsfuse resulted in a connectivity nightmare.

  • Downloading the same model assets on application start using the Python library for Cloud Storage was sluggish, provided limited reliability, and was required with each restart or provisioning of a new pod.

Persistent Volumes, Docker exec, and gsutil

As a result of a bit of trial-and-error, the remedying solution included mounting a persistent volume in Kubernetes, using docker exec to run gsutil within the container, and manually conducting a one-time transfer of the model files from the bucket to the volume.

Persistent Volumes

Provisioning a persistent volume requires creating a PersistentVolumeClaim as a prerequisite to creating the pod that will inevitably utilize it. By employing kubectl to deploy new resources, executing the below command on the following pvc.yaml configuration file will accomplish the desired result.

kubectl create -f pvc.yaml
# pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-disk
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi
Pod Deployment

Creating a pod for the container application within the Kubernetes cluster follows the same provisioning step but against a separate api-deployment.yaml configuration file.

Potential configuration setups may benefit from the usage of a single deployment file for all objects. In the instance discussed, they are split into separate files for clarity.

Generated initially from a docker-compose.yaml file using kompose, the further modified and slightly reduced example declares:

  • Labels and selectors to point the NodePort service, to be defined later, to the new Deployment pod
  • Environment variables representing
    • the bucket name containing the model files
    • the path by which the model files will be copied to and later read from
  • A persistent volume claim and the mountPath for which to mount said volume
  • The resources to be allocated to each pod
  • A route for the readiness probe in which GKE will perform the health checks on

…along with additional, more standard deployment configuration settings.

kubectl create -f api-deployment.yaml
# api-deployment.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: api-deployment
spec:
  selector:
    matchLabels:
      io.kompose.service: api
  replicas: 1
  template:
    metadata:
      labels:
        io.kompose.service: api
    spec:
      containers:
      - name: my-api
        image: gcr.io/my-project/my-image
        ports:
          - containerPort: 5900
        env:
        - name: GCS_BUCKET
          value: my-models
        - name: MODELS_PATH
          value: /mnt/disk/models/
        volumeMounts:
        - mountPath: "/mnt/disk"
          name: my-disk
        resources:
          requests:
            memory: "8096Mi"
            cpu: "500m"
          limits:
            memory: "16384Mi"
            cpu: "1000m"
        readinessProbe:
          httpGet:
            path: /healthz
            port: 5900
      volumes:
      - name: my-disk
        persistentVolumeClaim:
          claimName: my-disk
      restartPolicy: Always
CUDA-Enabled Container Image

As an additional note, the Dockerfile used for building my-image is a derivative of the nvidia/cuda:9.0-cudnn7-runtime container image with Python and the associated package requirements installed.

Transferring the Model Files

With the initial pod and volume created, transferring the models, currently stored locally, to the GCS bucket and then to the persistent volume can all be done with the following sets of commands.

Local Machine to Bucket

Recursively copying the large files in a multithreaded manner to ensure the most efficient transfer.

gsutil -m cp -r models gs://my-models
Get Pod Name

Retrieving the name of the created pod for which to execute the transfer in.

kubectl get pods
Bucket Through Pod to Persistent Volume

Running bash within the pod to install gsutil and to copy the files from the bucket to the persistent volume.

# Run `bash` within pod
kubectl exec -it <POD_NAME> -- /bin/bash
# Install `gsutil` and configure `gcloud` with the project
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
<PROJECT_NAME>

# Create the path within the persistent volume to transfer the files to
mkdir -p $MODELS_PATH
gsutil rsync gs://$GCS_BUCKET $MODELS_PATH

Now, the model files stored in the persistent volume are accessible by all current and future pods that are provisioned.

Exposing the API

The combination of a NodePort service, Ingress service, and global static IP will reliably expose the cluster publicly, with essential load balancing and SSL encryption.

NodePort Deployment

Creating a NodePort service for the API uses the same deployment process from prior but with the following api-service.yaml configuration file.

kubectl create -f api-service.yaml
# api-service.yaml
---
apiVersion: v1
  kind: Service
  metadata:
    name: api
    labels:
      io.kompose.service: api
  spec:
    type: NodePort
    selector:
      io.kompose.service: api
    ports:
    - name: "5900"
      port: 5900
      targetPort: 5900

Notice the metadata and selector fields match that of the previously deployed pod to establish a connection between the two resources.

Static IP Addressing

A new global IP address may be created to configure and expose the cluster to the outside.

gcloud compute addresses create my-dev-ip --global

Ingress Deployment

Including the kubernetes.io/ingress.global-static-ip-name annotation in the ingress-service.yaml deployment file, the newly created global static IP address will be linked to the ingress resource.

# ingress-service.yaml
---
apiVersion: "extensions/v1beta1"
kind: "Ingress"
metadata:
  name: "ingress"
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "my-dev-ip"
    ingress.gcp.kubernetes.io/pre-shared-cert: "ingress-cert"
spec:
  rules:
    - http:
        paths:
        - path: /*
          backend:
            serviceName: "api"
            servicePort: 5900

Naturally, the annotation with a key containing pre-shared-cert will specify the SSL/TLS certificate by the name of ingress-cert.

If Cloudflare happens to be the DNS provider of choice, an Origin CA certificate can be created, exported, converted using OpenSSL, and then added to GCP with the equivalent ingress-cert name.

# Adds the converted origin certficate
gcloud compute ssl-certificates create ingress-cert  \
  --certificate ./certs/<DOMAIN>.pem \
  --private-key ./certs/<DOMAIN>.key

Deploying the Ingress will trigger the readiness probe hitting the health check endpoint and loading the model into memory.

kubectl create -f ingress-service.yml

# Verify exposed IP address is ready
kubectl get ingress

A final step would be to include pointing the domain’s A records in Cloudflare to the global static IP address.

Pricing, Machine-Types, and Provisioning Resource Pools

Often the cost associated with operating these services is omitted when discussing deployment approaches. Including it as a consideration is imperative context for choosing the most optimal machine-types.

Understanding Instance Billing

One of the more precarious areas to navigate is the instance pricing. Running N1 machine-types with “Predefined vCPUs” costs approximately $0.038 / vCPU hour and $0.005 / GB hour for “Predefined Memory.” By these combined measures, an n1-standard-1 (1 vCPU and 3.75 GB) machine-type will run at approximately $0.057 / hour.

High-Memory Machine-Types

Monitoring memory usage of the application running locally, may establish the minimum resources necessary to provision the initial pool for sustainable use.

By the nature of how this particular example application operates, a higher system-memory-to-vCPU ratio will help the model to perform more efficiently while not underutilizing paid for CPU resources.

Through creating a pool with an n1-highmem-2 (2 vCPU and 13 GB memory) machine-type, above standard memory is allocated to each node provisioned in the cluster running at approximately $0.1421 / hour.

gcloud container node-pools create my-pool \
  --cluster=my-cluster \
  --machine-type=n1-highmem-2 \
  --num-nodes 1 --enable-autoscaling
  --min-nodes 1 --max-nodes 3

Note: Although untested directly, if the budget will support a node pool equipped with a GPU accelerator such as nvidia-tesla-t4 (1 GPU and 16 GB GDDR6 memory) at $0.95 / hour, processing time on the model may see a significant reduction when applied in conjunction with CUDA using the appropriate Theano configuration (.theanrc).

Migrating Resource Pools

Migrating from one resource pool to a new will require cordoning and evicting the deprecated pool, providing a seamless transition when upscaling or downscaling without incurring any downtime.

Touching on it briefly the following two commands will accomplish this task, where default-pool is the name of the deprecated pool.

Cordon Pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
  kubectl cordon "$node";
done
Evict Pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
  kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done

Other Solutions?

Certainly, managed services, dedicated to machine learning (e.g. Google ML Engine) are improving by the moment, but in the current state, the aforementioned is believed to be one of the few reliable strategies for production deployments.

Quite a lot of ground has been covered rather quickly, but I hope to have provided insight into the decision-making process of deploying a containerized application running an ML model on Google Cloud Platform.

As always, open and welcome to questions and comments!

Photo by unsplash-logoScoot Johnson