The story of what it took to setup NVIDIA GPU drivers and time-slicing in our GPU EKS cluster
Estimated reading time: 9 minutes
I spent six months working on a work project that I thought would take two weeks. This essay narrates the adventure in implementing GPU time slicing on our EKS kubernetes cluster. It began as a seemingly straightforward task – installing the gpu-operator. The work morphed into a long-lasting exploration that uncovered bugs, helped us improve internal processes, and eventually led to infrastructure upgrade.
My work was supposed to be simple. The ticket said ‘install the nvidia gpu operator and the time slicing addon-on into our EKS GPU cluster’. I headed to the NVIDIA official website, and they had instructions for exactly my requirements. How fortuitous, I thought, it’d be a done in a day or two.
Armed with clear instructions from the NVIDIA website, I started working on the ticket. I started with the seemingly simple task of installing the gpu-operator resource. I followed the provided Helm installation steps with care, when I encountered the first roadblock – a failed installation. The logs showed that the Helm chart contained 20 resources with Docker repository references pointing to NVIDIA’s registry. These references needed modification to work with Liberty Mutual’s Docker proxy. This seemingly minor adjustment was just the first obstacle on this uncharted course.
Undeterred, I looked further into the Helm chart further. The binaries it attempted to install were UBI binaries – a format incompatible with our Amazon Linux-based nodes. After modifying the binaries for compatibility, I was surprised to find the issue persist. Puzzled, I turned to the Kubernetes event logs. I found the root cause: the minimum required glibc version for several resources was 2.27. That exceeded the 2.26 version present in our Amazon Linux nodes.
To work around the AMI limitation, I considered alternatives. I could do a custom glibc installation, but would have change all links to existing binary. I could also ‘shim’ the binary calls to the old library and forward it to a newer version. However, shimming a system library is a destructive change that we would need to maintain forever. No go. Instead, I chose a different upgrade: transitioning our base AMI to a newer Amazon Linux version with the needed glibc update.
Figuring out AMIs (Amazon machine image) was challenging. Liberty-approved options were limited, I couldn’t just deploy a custom image. Serendipitously, my manager had recently been in a meeting with the company’s internal Containers team. I needed to contact the GDS CAAS (our internal container-as-a-service) team. A Solution Engineer in our group connected me with one of the engineers from CAAS. I found that they were using Ubuntu base images, to bypass glibc-related issues they had previously encountered. I also discovered that Liberty IT’s official guidance was for a migration away from Amazon Linux to Ubuntu. How fortunate! We would embrace Ubuntu, it was decided, to setup GPU time slicing in the cluster.
The Ubuntu Migration: Challenges and a Synchronized Upgrade
Migrating to Ubuntu base images presented a fresh set of issues. Our cloud-init scripts, customized for Amazon Linux, needed to be completely overhauled to manage GPU drivers for Ubuntu. The architecture difference also meant we’d need to re-install and resetup the operating system. That included GPU drivers, CUDA libraries, and so forth. While working on this with CAAS engineers, we found that their team was actively developing Kubernetes 1.24 Ubuntu AMIs (we were a couple of versions behind). I saw an opportunity, and proposed a synchronized upgrade to both the operating system and Kubernetes version. No better time to bring change, than by capitalizing on the existing! Considering the end-of-life status of our current Kubernetes version (v1.23), the upgrade offered better security and stability. I forged ahead with the Ubuntu image specifically designed for Kubernetes v1.24. At this point, I wasn’t thinking about the gpu operator anymore, we needed an EKS version upgrade, and an AMI upgrade to a different architecture. They were both pretty large changes.
The next step involved installing the requisite GPU drivers and libraries onto the Ubuntu image via the cloud-init script. However, obtaining clear working instructions was time-consuming and frustrating. After several painful long days that involved non-stop research and pleas for assistance to infrastructure channels on Slack, I made some progress. The drivers appeared to be installing. Finally! This victory, however, was short-lived.
Unearthing a Hidden Bug and Unforeseen Consequences
While installing the final driver, a critical issue appeared – our machine nodes were running out of disk space. After running SSH access into the nodes (which itself was a process as we didn’t have the correct permissions to allow ssh access), I discovered the missing piece: Liberty’s SSM access wasn’t enabled by default within our base infrastructure code. Consulting our internal knowledge base (Confluence), I identified the necessary IAM policies, which I manually attached to the relevant node roles. Access finally granted, I examined the nodes, only to discover that the driver installation had consumed the entirety of the allocated 20GB root drive space. However, our infrastructure code supposedly configured a 100GB volume to be mounted on the root directory – so where was the remaining space disappearing to?
This situation ultimately led to the unearthing of a hidden bug within our infrastructure code. The 100GB volume was never actually being mounted to the intended mountpoint, effectively rendering it unusable for all clusters using the code. This serendipitous discovery set a new challenge: identifying the correct mountpoint. After scouring AWS documentation and enduring a few initial missteps, I located the appropriate mountpoint. However, a further hurdle arose – the primary volume’s mountpoint wasn’t a parametrized variable within the code. To rectify this, I proposed a pull request in the internal eks-cdk library, successfully introducing the missing parameter and providing the correct value. Finally, the drivers installed successfully!
Network Disruption: A Port Conflict and a Community Rescue
While the cluster successfully formed and the nodes functioned, the cluster network gateways were in an error state. An investigation quickly made the issue clear: a DNS resolution conflict. CoreDNS, our cluster’s name resolver, was unable to start due to systemd-resolvd. They were both using port 53 on Ubuntu, so CoreDNS couldn’t start. This crippled intra-cluster communication.
We explored two potential solutions emerged: disabling the daemon or altering CoreDNS to use a different port. Disabling the daemon seemed less complex, but I initially tried modifying the CoreDNS port. This took a lot of time to change and debug, and ultimately failed. Backing out, I tried the simpler solution – disabling the systemd-resolvd service. With the new configuration, CoreDNS launched, restoring proper DNS resolution within the cluster.
Throughout this, I had successfully installed resources manually into the cluster’s sandbox account from my machine. However, replicating this process within the cdk Helm manifests created a new blocker. Our resource names, exceeding the 64-character limit imposed by Helm, resulted in installation failures. To work around this limitation, manual naming of the releases became necessary. While this fixed the immediate issue, the hardcoded names created a problem – the deployments lacked idempotence, meaning they couldn’t be reliably redeployed in the same state. This led to further refinements – once the cleanup process is complete, the release names would be dynamically generated based on deployment IDs.
Our issues were not yet done. Helm installation encountered a 503 service unavailable error while attempting to download binaries from the NVIDIA website. This was a significant challenge, as the cdk Helm manifests lacked the capability to handle authentication, a requirement for downloading NVIDIA binaries.
At this point I was frustrated. I took an unconventional approach to solving the issue at hand. I destroyed the cluster and forced a complete reinstallation. Surprisingly, this proved successful – the installation completed without errors, and the desired resources, including the gpu-operator, were present with GPU time slicing enabled!
The saga wasn’t over. The base image we used wasn’t readily available in the newly migrated Solaria AWS accounts. Publishing the images would require considerable effort. A joint plan with the GDS team is being formulated to ensure image availability before merging all changes to the main codebase. A year later after this initiative, and we are still not yet at Prod deployment readiness. To be fair we haven’t had the need to deploy to prod immediately. [Edit: We went to prod in Mar 2024, and things have been a lot more stable as of late.]
This experience is a great example of how we work at Solaria and Liberty Mutual. We have a strong partnership across different teams. We used creative ways out of difficult situations, and actively sought assistance from other teams. We discovered and fixed underlying issues, solving future problems for the entire company. All of that, just to install the GPU time-slicing in our EKS cluster. Which we later ended up not needing as our needs changed. But that’s a tale for a different day!