Today we published the results of my MacOS container performance evaluation work, where I reviewed how the recent enhancements of llama.cpp/Vulkan and libkrun improved hashtag#ramalama AI inference performance.

TLDR: 40x, thanks to the enablement and optimization of Vulkan GPU acceleration via para-virtualization, to escape the VM isolation.

Podman MacOS containers, running inside libkrun VMs, now perform at 75-80% of the native speed for AI inference \o/

And that’s not the end of it, as part of my ongoing work, I could get a POC of llama.cpp running at 92% of the native speed! Stay tuned 😊

Multi-size performance evaluation