Network performance in distributed training: Maximizing GPU utilization on OpenShift

in RHOAI, PSAP, Work | blog-post

Today, we published with my teammates Tanya Osokin and Michey Mehta the result of our multi-node distributed training investigation. We benchmarked the scale-up of OpenShift clusters with 8xH100 or 2xL40s GPUs and showed the uttermost importance of the network link inter-connecting the cluster nodes.

With the processing speed of the 8xH100 cluster, a 400 Gbps network was mandatory to improve the training throughput, and for the 2xL40s, a 200 Gbps network was enough.

OpenShift L40s multi-node training

Reach native speed with MacOS llama.cpp container inference

in CRC, Work | blog-post

Today we’ve published my exploratory work to get GPU inference on Podman Desktop/AI Lab and RamaLama up to native speed on MacOS! 🎉🎉🎉

In this platform (like on Windows), containers run inside Linux Virtual Machines. And while the CPU has always been running at full speed, GPU acceleration isn’t so easy.

The solution currently available is general-purpose. It enables the Vulkan API to jump out of the VM isolation. But it is limited to 75-80% of the native performance of llama.cpp (see this post https://lnkd.in/eMhm8SnB).

With this POC, we propose to focus on llama.cpp’s GGML interface to cross the VM boundary, and let the ggml-metal backend interact with the GPU. And this choice paid off, as we could demonstrate near-native performance across a range of inference configurations 🎉.

Look at the blog post to see how to test it with Podman Desktop and Ramalama. And of course, everything is open source 😀

Benchmark of llama.ccp API Remoting

How we improved AI inference on macOS Podman containers

in RHOAI, CRC, Work | blog-post

Today we published the results of my MacOS container performance evaluation work, where I reviewed how the recent enhancements of llama.cpp/Vulkan and libkrun improved hashtag#ramalama AI inference performance.

TLDR: 40x, thanks to the enablement and optimization of Vulkan GPU acceleration via para-virtualization, to escape the VM isolation.

Podman MacOS containers, running inside libkrun VMs, now perform at 75-80% of the native speed for AI inference \o/

And that’s not the end of it, as part of my ongoing work, I could get a POC of llama.cpp running at 92% of the native speed! Stay tuned 😊

Multi-size performance evaluation

Sharing is caring, how to make the most of your GPUs? part 2, Multi-instance GPU

in RHOAI, PSAP, Work | blog-post

Today, my teammate Carlos Camacho published a blog post that continues the work I started on NVIDIA MIG GPUs:

In the part one about fractional GPUs, we talked about time slicing as “carpooling” for your GPU – getting more people (processes) into the same car (GPU) to use it more efficiently. In this second strategy, called multi-instance GPU (MIG) partitioning, imagine that for the same “carpooling” we get numbered and sized seats for each person so everyone knows where to sit and where they fit. This approach allows for dividing GPUs into isolated and static instances for concurrent usage by different applications.

RHOAI MIG Sharing

Sharing is caring, how to make the most of your GPUs? part 1, Time-sharing

in RHOAI, PSAP, Work | blog-post

Today, my teammate Carlos Camacho published a blog post that continues the work I started on the performance valuation of time-sharing in NVIDIA GPUs:

GPU oversubscription is like “carpooling” for your GPU – you’re getting more people (processes) into the same car (GPU) to use it more efficiently. This approach helps you get more throughput, keeping the overall system latency under specific service level agreements (SLAs), and reducing the time the resources are not used. Of course, there can be some traffic jams (too many processes racing for resources), but with the right strategies, and the understanding of your workloads, you can keep the systems consistently outperforming.

RHOAI MIG Sharing

Continuous performance and scale validation of Red Hat OpenShift AI model-serving stack

in RHOAI, PSAP, Work | blog-post

Today, my blog post on continuous performance and scale testing of Red Hat OpenShift KServe model serving stack was published!

Great work in collaboration with multiple persons from PSAP team and RHOAI QE/dev teams.

It presents the results of three different flavors of performance and scale testing. Each flavor focuses on a particular aspect of the KServe model serving stack:

  • Single-model performance testing
    • Focuses on the performance of the model and its serving runtime to verify that it does not regress over the releases.
  • Multi-model performance and scale testing
    • Focuses on the performance of the model serving stack when running under heavy load but at low scale.
  • Many-model scale testing
    • Focuses on the scalability of the model deployment stack when running at large scale.

RHOAI_kserve_testing

A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks

in RHOAI, PSAP, Work | blog-post

Today we published a blog post presenting the results of my last 6 months of work: testing Red Hat OpenShift Data Science with 300 users requesting notebooks within 15 minutes.

It was a huge work getting the scale testing infrastructure in place, but it was fruitful :) Along the way, we highlighted: - a network component dealing badly with its frequent reconfiguration (it was randomly throwing 404 error). It got removed from the architecture. - a control plane overload leading to its collapsing (and auto-recovery). The component spamming the Kubernetes API Server got refactored to avoid the compute-intensive aggressive requests. - multiple race conditions in the Web UI, appearing under random conditions (including the system load, but not only) and hence hard to observe and reproduce manually. We tracked down the route cause of the issues and got them fixed.

The blog post shows the final result, where OpenShift Data Science and scale test infrastructure are running happily, there wasn’t enough space for describing the route to reach it, pity 😅 🐞

And that’s just the beginning, now that the baseline is defined, we need to bring in more users, in less time, and optimize the time for getting a notebook … still a lot of work ahead!

RHODS notebook scale testing

A Guide to Functional and Performance Testing of the NVIDIA DGX A100

in Work | blog-post

My blog post on NVIDIA DGX A100 GPU testing got published yesterday on OpenShift blog :)

In this blog post, we present how we performed the function validation of the OpenShift GPU Operator running on eight GPUs within DGX™ A100. We describe the different MIG modes we tested, as well as the values of the node labels and Kubernetes resources exposed with these different settings. We also conduct a performance benchmark, involving the eight GPUs running simultaneously, either all training a single AI/ML model or all performing independent computations.

NVIDIA DGX Testing

Entitlement-Free Deployment of the NVIDIA GPU Operator on OpenShift

in PSAP | blog-post

Last night, we published the blog post presenting the work I lead during the past few months, where we removed the need to deploy RHEL entitlement certificates to build and deploy the GPU driver of NVIDIA GPU Operator. This requirement for a valid RHEL subscription key was a major burden for OpenShift GPU computing customers, as the key generation and deployment process couldn’t be properly automated.

This work was a great cooperation effort, as it required the enhancement of multiple parts of the overall software stack:

  • first at the team level with enhancements of the Node Feature Discovery (Eduardo Arango) and the OpenShift Driver Toolkit container image (David Gray and Zvonko Kaiser) + Ashish Kamra
  • then at the project level, with core OpenShift corner-cases bugs discovered and solved, revolving around the Driver Toolkit dynamic image-streams,
  • finally inter-company Open Source cooperation, with NVIDIA Cloud-Native team (Shiva Krishna Merla) reviewing the PRs with valuable feedback and bugs spotted in the middle of the different rewriting of the logic!

Link to the blog post

Timeline of the project milestones (in 2021):

  • May 27th..June 1st: idea of using the Driver Toolkit for entitlement-free arises from Slack discussion, to solve disconnected cluster deployment challenges. Confirming quickly that with minor changes, the DTK provides everything required to build NVIDIA driver.

  • July 30th..August 11th: working POC of the GPU Operator building without entitlement, without any modification of the operator, only a bit of configuration + manually baked YAML files

  • August 26th..~November 15th: intensive work to add seamless upgrade support to the POC and get it all polished, tested and merged in the GPU Operator

  • December 2nd: GPU Operator v1.9.0 releases, with the entitlement-free deployment enabled by default in OpenShift \o/

It’s funny to see how it took only a couple of days to get the first POC working, while the integration of the seamless upgrade support took two full months of work!

(Seamless upgrade support is the idea that at a given time during the cluster upgrade, different nodes may run different versions of the OS. With one-container-image-for-all-os-versions, no worry, the driver deployment will work all the time; but with one-container-imager-per-os-version, that’s another topic! This is covered in-depth in the blog post.)