Last night, we published the blog post presenting the work I lead
during the past few months, where we removed the need to deploy RHEL
entitlement certificates to build and deploy the GPU driver of NVIDIA
GPU Operator. This requirement for a valid RHEL subscription key was a
major burden for OpenShift GPU computing customers, as the key
generation and deployment process couldn’t be properly automated.
This work was a great cooperation effort, as it required the
enhancement of multiple parts of the overall software stack:
- first at the team level with enhancements of the Node Feature
Discovery (Eduardo Arango) and the OpenShift Driver Toolkit
container image (David Gray and Zvonko Kaiser) + Ashish Kamra
- then at the project level, with core OpenShift corner-cases bugs
discovered and solved, revolving around the Driver Toolkit dynamic
image-streams,
- finally inter-company Open Source cooperation, with NVIDIA
Cloud-Native team (Shiva Krishna Merla) reviewing the PRs with
valuable feedback and bugs spotted in the middle of the different
rewriting of the logic!
Timeline of the project milestones (in 2021):
-
May 27th..June 1st: idea of using the Driver Toolkit for
entitlement-free arises from Slack discussion, to solve disconnected
cluster deployment challenges. Confirming quickly that with minor
changes, the DTK provides everything required to build NVIDIA
driver.
-
July 30th..August 11th: working POC of the GPU Operator building
without entitlement, without any modification of the operator, only
a bit of configuration + manually baked YAML files
-
August 26th..~November 15th: intensive work to add seamless upgrade
support to the POC and get it all polished, tested and merged in
the GPU Operator
-
December 2nd: GPU Operator v1.9.0 releases, with the
entitlement-free deployment enabled by default in OpenShift \o/
It’s funny to see how it took only a couple of days to get the first
POC working, while the integration of the seamless upgrade support
took two full months of work!
(Seamless upgrade support is the idea that at a given time during the
cluster upgrade, different nodes may run different versions of the
OS. With one-container-image-for-all-os-versions, no worry, the driver
deployment will work all the time; but with
one-container-imager-per-os-version, that’s another topic! This is
covered in-depth in the blog post.)