In the guide, I will target utilizing consumer-grade graphics such as the GTX and RTX series of graphic cards from NVIDIA for container workloads on Kubernetes. If you have read through my
previous
posts
, I am migrating the services I used to host
on plain old docker-compose to Kubernetes. One such workload was Jellyfin, which was hobbling when using its transcoding feature on the puny Intel Iris Plus 655 integrated graphics. When I bought new hardware for the new cluster, I picked up an NVIDIA T400 to go along with it. Although not squarely consumer-grade, it is very similar to one, as in, there is no support for it in ESXi for GPU virtualization or anything special like that.
I started by installing a Photon OS node
for utilizing the graphics card, passed the card through from ESXi, and joined it to the k3s cluster. Now comes the tricky parts. The overall steps include installing the drivers on the VM, getting the container runtime set up, using the runtime for workloads, and finally telling Kubernetes that it has a resource for GPU using the device plugin. Let’s get started on that, shall we!
Most of the steps needed are present in the issue vmware/photon#1291
I opened for troubleshooting this on Photon OS. Other operating systems will utilize a similar method, and there are many guides out there for the common distros, unlike Photon OS. I’ll summarize the TL;DR here:
# Get Kernel sources❯ tdnf install linux-esx-devel
❯ reboot
# Get build tools and other needed packages❯ tdnf install build-essential tar wget
# Download the right driver for your card from https://www.nvidia.com/download/index.aspx❯ wget https://us.download.nvidia.com/XFree86/Linux-x86_64/510.54/NVIDIA-Linux-x86_64-510.54.run
# Unmount /tmp to not use tmpfs which runs out of space during driver install❯ umount /tmp
# Run the installer. Select OK to guess X library path (we don't have X)# OK to no 32-bit install and No to nvidia-xconfig.# If asked for DKMS, answer No until the issue is resolved# https://github.com/vmware/photon/issues/1287❯ sh NVIDIA-Linux-x86_64-510.54.run
# Reboot to pick up any necessary changes and revert the /tmp to tmpfs❯ reboot
# Check if the graphics card was detected, sample output below❯ nvidia-smi
Thu Feb 17 21:55:41 2022+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================||0 NVIDIA T400 Off | 00000000:0B:00.0 Off | N/A || 34% 44C P0 N/A / 31W | 0MiB / 2048MiB | 0% Default |||| N/A |+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+
❯ curl -s -L https://nvidia.github.io/nvidia-docker/centos8/nvidia-docker.repo | tee /etc/yum.repos.d/nvidia-docker.repo
# Note that the package is nvidia-container-toolkit, which replaces the nvidia-container-runtime.# Another point of confusion right here.❯ tdnf install nvidia-container-toolkit
Installing the above package “should” also configure containerd. Ensure that the configuration file has the below config as shown:
Now that we have a runtime that supports NVIDIA, we need to inform Kubernetes (k3s) about the same. There is more information in the issue k3s-io/k3s#4070
. I ended up getting most of the needed information from there. Apply the below YAML file using the below command.
Kubernetes provides resources to pods. It bundles the cpu and memory type resources. To add a GPU to be a resource, we need to inform about it using a device plugin. This plugin checks the nodes and updates the same to the kube-api server so that it can give it away when requested by workloads. NVIDIA provides an official device plugin
as an open-source project. We can install the latest version (as of writing this) using the below command.
We need to modify the device plugin to use the nvidia runtime and only run on nodes with the nvidia runtime. Otherwise, the pods get stuck in ContainerCreating.
To run GPU workloads, remember to use the following in the pod spec. Here is a sample Jellyfin deployment
that I use.
1
2
3
4
5
6
7
8
9
10
11
spec:runtimeClassName:nvidia # Specify nvidia as the runtimeClasscontainers:- name:graphicWorkloadimage:nvidiaWorkloadImage # NVIDIA compatible imageenv:# This is "supposed" to be injected by the device plugin- name:NVIDIA_DRIVER_CAPABILITIESvalue:allresources:limits:nvidia.com/gpu:1# Request a GPU
Without using DKMS in step 1 while installing the drivers, an upgrade of the Linux kernel necessitates a reinstall of the driver.
The whole process of getting this set up took me a better part of a week. I’m pretty sure I’ll be using this as a guide myself on my next install of a K3s cluster with NVIDIA graphics. I hope this helps you too!