A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks
2022-12-13 in RHOAI, PSAP, WorkToday we published a blog post presenting the results of my last 6 months of work: testing Red Hat OpenShift Data Science with 300 users requesting notebooks within 15 minutes.
It was a huge work getting the scale testing infrastructure in place, but it was fruitful :) Along the way, we highlighted: - a network component dealing badly with its frequent reconfiguration (it was randomly throwing 404 error). It got removed from the architecture. - a control plane overload leading to its collapsing (and auto-recovery). The component spamming the Kubernetes API Server got refactored to avoid the compute-intensive aggressive requests. - multiple race conditions in the Web UI, appearing under random conditions (including the system load, but not only) and hence hard to observe and reproduce manually. We tracked down the route cause of the issues and got them fixed.
The blog post shows the final result, where OpenShift Data Science and scale test infrastructure are running happily, there wasn’t enough space for describing the route to reach it, pity 😅 🐞
And that’s just the beginning, now that the baseline is defined, we need to bring in more users, in less time, and optimize the time for getting a notebook … still a lot of work ahead!