k8shazgpu -- an extension of canhazgpu for vLLM development on remote GPUs in a k8s cluster

10 Oct 2025

I’ve put together an improvement on a rad tool called canhazgpu. My buddy Russell built this tool to replace “ye ole spreadsheet” the team would use to reserve GPUs on a shared machine. It’s an awesome improvement for developers sharing GPUs – canhazgpu handles all the GPU allocation we used to track manually (on a, AHEM, spreadsheet if you didn’t catch that the first time). It’s primarily designed for a single host.

I started to wonder: What if we wound up extending this tool for a cluster of machines?

And you’d guess what my next thought would be: Let’s use Kubernetes Dynamic Resource Allocation (DRA) – well you’d make that guess if you’d read my article about k8s DRA for networking. But it’s partially because this is the kind of thing that DRA is designed for: Scheduling workloads based on limited and changing resources on the cluster, especially those that are hardware bound (Like GPUs [and, yes, sometimes network devices!]). Side note: DRA just graduated to GA in K8s 1.34, about a month ago now!

Today, some context and a demo, and next time, I’ll try to get this to a point where you can spin up your own cluster and allocate workloads on it dynamically. But today – grab your cappuccino and let’s walk through it quick. Well, at least it’s cappuccino o’clock here, and mine happens to have Stewart’s whole milk, as I was just in the Adirondacks this past weekend on some paddle day trips. The locals out there pronounce it STORTS. It’s a convenience store, but also known for their dairy, makes for fabled ice cream stops and was recently awarded the best milk in NY state (again).

Oh yeah – by the way if you didn’t catch it immediately canhazgpu is a reference to the I can has cheezburger meme, which is kind of a king of a meme empire.

tl;dr – Just give me the youtube video!

If you’d like to jump right in and see it in action… Here you go! Make sure to click the chapters if you’re impatient.

…There’s also an asciinema demo later in the article if you prefer that!

Demo: k8shazgpu — Watch on YouTube

Some backstory…

Quick history on canhazgpu: Folks in my team have a GPU server that they refer to as “Beaker” (they give Muppet nicknames to their machines), it’s got 8 high powered GPUs, and we recently rebuilt it, and I added automation to keep it under config management. But, Russell was a real hero to the developer experience when he put together canhazgpu – because previously people were looking up who had reserved GPUs in a spreadsheet.

So instead of looking up the GPUs in a spreadsheet, and logging your own use, you’d just type…

canhazgpu run --gpus 2 -- vllm serve my/model

canhazgpu figures out which GPUs are available on the machine, and then sets CUDA_VISIBLE_DEVICES, sets their reservation in a local data store, and then runs the command after the --, sorta like sudo echo make me a sandwich runs everything after sudo.

It’s really pretty too, has a beautiful canhazgpu status to show who’s using what, it uses a heartbeat pattern to figure out who’s using what and when they’re done, and can even detect usage outside of canhazgpu if someone didn’t use it.

The team has adopted it pretty quickly, and Russell’s been sending off super vibe code agents to bug fix it, and of course with him puppeteering it like some kind of puppet virtuoso.

I’ve been working a new role lately, and have been managing some of this infra (including Beaker) in my new role as an MLOps engineer, and it includes provisioning machines for kind of two classes of use: automation systems, and developer workstations. So I’ve been slanging Ansible playbooks.

Well… I got to thinking, what if we could scale this usage for canhazgpu to multiple machines?

On top of that, I’ve been collaborating with the upstream SIG-CI for vLLM. We manage the ci infra for vLLM and are working to improve how our CI works. I’ve been proposing some changes to how Docker images are built and run in order to get some more efficiency, and better developer experience for getting CI signal quickly.

I decided I wanted to combine both of these things:

The ease of use of canhazgpu
Speedy developer test runs of vLLM.
Kubernetes.

A DRA driver based on canhazgpu seemed to be just the right direction.

So I fired up Claude Code (and GitHub Copilot), drank a LOT of coffee, and brewed up this demo…

Why `k8shazgpu`?

While canhazgpu keeps sharing simple on one box, k8shazgpu scales that same philosophy across an entire cluster.

My goal was to make a Street Fighter-style combo of the ease of canhazgpu with the power of Kubernetes and dynamic resource allocation. The result is a DRA driver and controller that lets you request a GPU on any machine in your k8s cluster. Under the hood, k8shazgpu defines custom ResourceClaims, maintains a cache of images and git repos, and runs node agents that manage GPU allocation. When you ask for a GPU, the controller provisions a vLLM pod with the right image and mounts the cache, so your model starts serving almost immediately.

Key features include:

Declarative resource claims: you ask for a GPU and k8shazgpu uses Kubernetes’ Dynamic Resource Allocation to fulfill the request.
Intelligent caching: cache plans describe what images, repos and models should be pre‑fetched; node agents keep those caches warm.
vLLM integration: k8shazgpu automatically detects when you’re running from a vLLM checkout, figures out the merge‑base, packages your local changes and applies them in the container before launching.
Simple CLI: basic commands like status, vllm run, and cleanup make the tooling approachable.

To make a test run, you’d simply make a call like:

k8shazgpu vllm run -- vllm serve facebook/opt-125m --gpu-memory-utilization 0.8

Behind the scenes, all the magic is happening: The resource claims and the caching mechanisms (that were inspired by my work on improving CI )

Just like canhazgpu, developers using k8shazgpu don’t have to know anything about Kubernetes or DRA. You run the CLI and the controller does the dirty work. DRA itself is a fantastic building block, but it’s not a great end‑user interface. With k8shazgpu we hide that complexity behind a friendly CLI.

Let’s see it in action!

Here’s a demo, quick-like, it shows how you operate from a vLLM clone on a client workstation and make a request using k8shasgpu.

It’s a bit raw (the tmux panes render funny in asciinema), sorry, but it shows the basic workflow of requesting a GPU, running vLLM, and queueing up requests when you’re out of GPUs.

Improvements in flight!

Kubernetes 1.34 shipped in August 2025 with DRA APIs finally graduated to GA. My buddy Miguel Duarte has a pull request the bumps our code to the new v1 API. That means the controller is ready for clusters running the latest release. (For users, nothing would change – you still don’t need to touch any YAML, or even understand the DRA primitives! phew!).

Miguel and I worked on a talk about DRA for networking in Kubernetes at FOSDEM last year, and I’m looking forward to batting around some ideas about how this further pushes the conversation about high performance computing environments – both for AI/ML use cases as well as high performance networking use cases, which I’m more and more convinced are intertwined.

If you’re curious about the code, check out the branch k8shazgpu-demo-worthy. Miguel’s PR is a small but important patch that removes v1beta1 and updates our manifests for DRA’s GA.

Waxing philosophical on next steps.

Russell’s canhazgpu is totally “vibe coded”. So, I build off of that. It’s awesome how far Russell got with it, and he’s clearly great at using the tool and pushing great results out of it. I’ve been kind of “vibe coding” for a few years at this point. Even as an experiment I built a CNI plugin called Surveyor for a FOSDEM talk by writing all the boiler plate with an LLM, and then by copy‑pasting giant hunks of code. These days the agentic technologies are getting better –- Claude, GitHub Copilot, and even OpenAI’s codex helped me prototype k8shazgpu. It still takes a human to glue everything together, but the tooling makes rapid prototyping fun.

I’m looking forward to seeing if I can garner some adoption of this tool among my team, and looking to see if it can be integrated in, or concepts from it, used in upstream vLLM CI.

But, I have a feeling there’s a few other directions to move in…

I’d love to share further with people how they can learn DRA and learn to utilize it as is it now.
Right now k8shazgpu is for X-number of GPUs on a single host – your workload is limited to a single host. What if your workload is multi-host, like llm-d?
I’m rather curious about how that impacts networking across boxes.

Until next time, enjoy your cappuccino and happy vibe hacking!

vLLM at Home -- Distributed Inference with Fedora and a 50-Series GPU

09 May 2025

I’ve been pretty excited to say hello to vllm, which is a library for LLM serving and inference. That along with some recent upgrades in my lab setup, I figured it was time to do it in the way that best fit my own brand, with these parameters:

Running on Fedora
Containerized (using Podman)
Using my new 50-series card in my new desktop
And… Distributed inference across my desktop and my GPU lab server.

It was fun to do, and satisfying! I learned a lot and I’m hopeful to be able to contribute something back to the community based on what I learned. Quick note that… This is totally what I’d call a “toy setup”. If I learned anything, it’s that vLLM has a design that’s meant for industrial-type applications, like, in the DCs. The kind of place where I used to drink burnt coffee at 2am in my telco switch tech days. But for this one – we can drink legit espresso at home and kick it laid back in some Thai fisherman pants. If you’re following along, I recommend a flat white for this one, I’ve been making mine using cream-top milk from Sweet Rowen Farmstead here in Vermont. It’s got a high milk fat content, so it basically doesn’t foam at all.

I’m inspired in part by this YT video about running an LLM across multiple computers. With the work I’ve been doing in exploring Dynamic Resource Allocation (DRA) in Kubernetes, this really speaks to me. And! I’ve got a good lab setup for it, too. I like this demo because it used prosumer hardware and it was well put together. However! It doesn’t use a containerized setup. I’m really interested in the distributed aspect, because I’m also curious how this might impact K8s networking in the future, too. Last but not least I recently made an upgrade in my lab, so I’ve got 30, 40, and 50-series Nvidia cards. I had a hard time deciding if I should buy a 50-series, but I found a good deal on a 5060ti and was pretty excited it had a good chunk of VRAM, I kept wondering if it’d be worth it. Well, I have to say, with what I was able to learn using it, already worth it.

The tl;dr

Goal: Run vLLM in my home lab with distributed inference across two machines using a 50-series NVIDIA GPU (sm_120 arch).

The official image doesn’t work with 50-series GPU yet, due to unsupported CUDA arch, I built a custom Docker image using CUDA 12.9 and compiled with sm_120 support.

I used Podman on Fedora, with nvidia-container-toolkit, and CDI setup for GPU access.

Distributed inference handled via Ray; using --net=host for simplicity in a lab environment.

Built image: quay.io/dosmith/vllm & docker.io/dougbtv/vllm (~26GB)

Bonus: Played with structured outputs – super awesome.

Next steps: Might see what I can contribute back to vLLM, especially on the tooling and CI front.

…That being said, grab your flat white and let’s hit the terminal…

Getting my bearings

My initial instinct was to build an image from the ./docker/Dockerfile.nightly_torch in a clone of vllm, because I liked the way it read the most of the Dockerfiles, especially because I was worried with a 50-series, I might have to use something poppin’ fresh.

But, in the meanwhile I tried to use their official Docker image on my desktop workstation.

However, I had to setup CDI, first.

I ran into this on a docker run…

Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all

So I am reminded of the Nvidia container toolkit, and the Nvidia container tool kit docs on CDI.

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# This was unnecessary, and didn't work with dnf5
# sudo dnf config-manager --enable nvidia-container-toolkit-experimental
sudo dnf install -y nvidia-container-toolkit
# Don't forget to generate the CDI manifest!
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Unsure if I needed to, but, out of instinct I did...
sudo systemctl restart podman

If you’re not familiar with CDI, it’s the “container device interface”, and the CNCF container-device-interface repo is probably the place to go to learn more, but it allows us to allocate the GPU for our container to use.

Using the official vLLM Docker image

I picked up the docs for using vllm’s official docker image, which I think – in most cases, this will probably work with most hardware, but as you’ll see, we needed a little bit more for the 50-series.

And then I pulled with:

podman pull vllm/vllm-openai:latest

And with that, I also looked at the vLLM distributed serving docs.

The docs recommend this run command, for podman…

podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-v0.1

Also, their example uses Mistral 7B v0.1, which I needed access for, so, make sure your hugging face token is there, apparently you have to agree to share some info with the authors to download it. So I did that on hugging face, and got it to run, but was met with…

NVIDIA GeForce RTX 5060 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

Getting an `sm_120` compatible build

By sm_120, I’m referring to the compute capability of NVIDIA Blackwell architecture (that is, the RTX 50-series GPUs).

Now it’s time to roll up our sleeves and make a full build. In the end, I used this modified Dockerfile, more on my process as you read through. I did bump CUDA to 12.9 from 12.8, and I also had to make sure I specified the new sm_120 arch, too.

Mostly, I could just use:

podman build \
  -t dougbtv/vllm \
  -f docker/Dockerfile.nightly_torch .

I got a hint from my bud Russell about VLLM_USE_PRECOMPILED=1, which uses vLLMs prebuilt wheels. I get a build after about… 30-45ish minutes? Nice way to toast up my CPU and let my AIO cooler purr. During the build I see a lot of build instructions with sm_* stuff but not sm_120 which had me concerned, and, I didn’t get a copy/paste because it exceeded my terminal buffer, oh well! I wound up with a number of attempts, and was getting stuff when I tried to podman run it, like:

# vllm serve --model mistralai/Mistral-7B-v0.1
[...snip...]
ModuleNotFoundError: No module named 'vllm._C'

I found a (kinda old) github issue with a reference to this error, but, I had a feeling it had something to with how I built it.

Typically, I’d be running it like this, and also referencing the docs for the engine arguments:

podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --ipc=host \
  localhost/dougbtv/vllm:latest \
  mistralai/Mistral-7B-v0.1

I figure out that I missed a major thing – the arch list for 12.0, aka sm_120

So, I need stuff like:

ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0;12.0'
# and...
ARG vllm_fa_cmake_gpu_arches='80-real;90-real;120-real'

During the build – I had a couple unlucky crashes (which I didn’t go deeply into to diagnose, new F42 and all that), but! I got a build. Of course I forgot to time it, it happened overnight. I also need to figure out the caching cause I don’t think I’m caching and the Dockerfile sure looks like it has a lot of hooks for caching, and it must be SUPER important with these huge builds.

I posted my originally working Dockerfile as a GitHub gist and a diff of it (used against commit 6115b115826040ad1f49b69a8b4fdd59f0df5113, but! Be warned, I think at some point in the Dockerfile build process it pulls a fresh head from a git remote).

Some hiccough prevented me from logging into quay because of some weirdness, so I pushed… (Update! I have it on quay now!)

quay.io/dosmith/vllm

Which you can find on Dockerhub as dougbtv/vllm. It’s about 26 gigs, so, buyer beware.

Getting vLLM running on a single host and giving it a test.

And I run it like this:

podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --ipc=host \
  localhost/dougbtv/vllm:latest \
  mistralai/Mistral-7B-v0.1 \
  --gpu-memory-utilization 0.5

I’m getting a lot further, but I see:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 15.47 GiB of which 68.00 MiB is free. Including non-PyTorch memory, this process has 12.51 GiB memory in use.

I tried to look around quite a bit… But I realized something VERY important. Here’s the thing:

vLLM isn’t ollama. vLLM is meant for high-throughput inference serving in data centers, not the scrappy-style local inference that I usually do. So, something like ollama uses some quantized GGUF models and CPU offloading to work in constrained environments, but, this is designed differently.

I previously assumed I could run a model that sized on this 5060ti with 16 gigs of VRAM, but, not this way. So I decided to run a tiny model instead, facebook/opt-125m

podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --security-opt=label=disable \
  --device nvidia.com/gpu=0 \
  --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --ipc=host \
  localhost/dougbtv/vllm:latest \
  facebook/opt-125m \
  --gpu-memory-utilization 0.8

And it “works”!

 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "What is the capital of Vermont?",
    "max_tokens": 20
  }'
{"id":"cmpl-b062e3b04f48487590b36e21cfba0839","object":"text_completion","created":1746708569,"model":"facebook/opt-125m","choices":[{"index":0,"text":"  Edit: Or, I heard Vermont is a home state\nI don't think you put the","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20,"prompt_tokens_details":null}}

Maybe just “Montpelier” would be an appropriate answer.

But, like The Bad News Bears, sometimes I just gotta try it, put the heart into it. And that part worked! I got the 50-series working with vLLM which was half the goal.

I tried loading a quantized model, took me a few tries to get the params right…

podman run \
  --gpus all \ 
  -v /home/doug/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --ipc=host localhost/dougbtv/vllm:latest \
  TheBloke/Mistral-7B-Instruct-v0.1-AWQ \
  --quantization awq \
  --dtype float16 \
  --gpu-memory-utilization 0.95  

Which also appears to work…

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    "prompt": "What is the capital of Vermont?",
    "max_tokens": 20
  }'
{"id":"cmpl-4c7d34a0a8a547d4b4fffbaa2b64fdfa","object":"text_completion","created":1746721156,"model":"TheBloke/Mistral-7B-Instruct-v0.1-AWQ","choices":[{"index":0,"text":"\nMontpelier","logprobs":null,"finish_reason":"stop","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":14,"completion_tokens":5,"prompt_tokens_details":null}}

Great. Nailed it this time.

And give it a run on the 3090 machine…

Now on the 3090 machine, which I call stimsonmt. It’s named for a nearby mountain to my house, it sits next to another box bonemt which is on the other side of the valley, so, that way I know which is which positionally, as they’re pets! Fun fact, Whereabouts IPAM CNI’s topographic logo is based on the topo for Stimson Mt!

Going to try with the example mistral model from the docs, again…

podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --security-opt=label=disable \
  --device=nvidia.com/gpu=all \
  --ipc=host \
  docker.io/dougbtv/vllm \
  mistralai/Mistral-7B-v0.1 \
  --gpu-memory-utilization 0.95

I had to add --device=nvidia.com/gpu=all certainly because of how I had setup podman on this machine (I went and scooped the podman run that I’m using for kohya_ss! so I had a good reference). I’ll try to normalize them later.

Appears to run… let’s try it…

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "What is the capital of Vermont?",
    "max_tokens": 20
  }'
{"id":"cmpl-fa383b641f79470495d3e9a0fa537c25","object":"text_completion","created":1746725937,"model":"mistralai/Mistral-7B-v0.1","choices":[{"index":0,"text":"\n\nThe capital of Vermont is Montpelier, which is also the state's smallest","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":29,"completion_tokens":20,"prompt_tokens_details":null}}

Took 4 tries before it got the right answer, but it did work! That used basically the whole GPU VRAM, too.

doug@stimsonmt:~$ nvidia-smi --query-gpu=memory.total,memory.free --format=csv
memory.total [MiB], memory.free [MiB]
24576 MiB, 824 MiB

B-b-b-bonus round: Structured output!

While I was at this point in the process, I wound up joining the vLLM office hours and listened to Russell talk about structured output, so I also tried that out.

Structured output is a process to kind of mold the LLM to give you output in a specific structure, like, say you want a return in all JSON, or a specific set of values, heck you can even use a regex!

I brewed up this python script (as the curl was getting out of hand), and made a script to determine if the sentiment is “rad” or “bogus” depending on what you feed in, so I also tried that…

$ python test.py "the powder is wicked in the trees dude"
rad
$ python test.py "parcheesy is the best game"
bogus

Getting it running across nodes.

Now that I had a baseline, I went to see if I can run it on two nodes…

After refreshing myself on a few moments in the YT video – I realize that there’s one thing I’m probably going to have to account for for now, you have to specify an interface as some parameters.

This makes me feel a few ways:

It’s so network infra kinda stuff: Specifying the interface.
This has been a decade long battle in container land.
…I know I’ve got this one on lock, one way or another *puts shades on*

We’re going old school, where it all begain for me with containers and networking…

--net=host

For this scenario, it’s probably the most efficient, and security isn’t my biggest concern. As per usual, I have a long story about how I used to use this, and now I don’t, and I feel like I was instrumental in making it so that people don’t have to do that anymore, but for today? It’s just the thing. Maybe we can return to this to handle it another way, but it’s also kind of its own project.

You’ll notice here that we’re using ray-project/ray, ray:

Ray is a unified framework for scaling AI and Python applications

Sounds just like what we want to do, awesome.

Ok that’s a good start… ray is already built in our image, phew!

root@478a2ec83613:/vllm-workspace# ray --version
2025-05-09 05:04:41,498 - INFO - Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-05-09 05:04:41,498 - INFO - NumExpr defaulting to 16 threads.
ray, version 2.46.0

Unfortunately, iproute2 isn’t installed, so we have to install it with apt install iproute2, but I wasn’t quite expecting it – but I see the host interface inside the pod.

root@e60d5253152e:/vllm-workspace# ip a | grep -P "^\d|192"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    inet 192.168.50.198/24 brd 192.168.50.255 scope global noprefixroute eno1

And I can inspect it with:

$ podman inspect --format '' condescending_mendel
pasta

Apparently! I’m not familiar with the ins-and-outs of pasta networking from podman. These docs provide a pretty good overview, it’s named that way for “Point-to-point Adaptive Shared Transport Architecture”. I’m so used to CNI. Sooooo! I’m not sure of the risks, so, we’re sticking with host networking.

Here’s a run with host networking that I’m using for checking the situation out…

podman run -it --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --net=host \
  --ipc=host \
  --entrypoint=/bin/bash localhost/dougbtv/vllm:latest

I’m going to have my stimsonmt host act as the “head”, the main node, the only reason I chose it is because it has more VRAM, having a 3090. My other host (named: yoda, long story on that one) will be a worker.

So I run the start command, I can see it have the host IP address…

root@stimsonmt:/vllm-workspace# ray start --head --port=6379 --memory 24000000000
[...snip...]
Local node IP: 192.168.50.199

Let’s try to join a node… I’m just giving it 12 out of 16 gigs because yoda is a workstation, so my GUI is hogging VRAM.

ray start --memory 12000000000 --address='192.168.50.199:6379'

Cause for a w00t, I can see the combination with ray status on the head (or the node, too)

root@stimsonmt:/vllm-workspace# ray status
2025-05-09 05:39:18,663 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2025-05-09 05:39:16.967610 ========
Node status
---------------------------------------------------------------
Active:
 1 node_a9d14d6a94d5e77fbdafdb8802dada067a37417f2f7ad98426aa3eb7
 1 node_4122ac6d486c994414de5ed9704c329002b5ec596fb38dd0806fe1fd
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/40.0 CPU
 0.0/2.0 GPU
 0B/33.53GiB memory
 0B/46.26GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

I compile the whole commands…

For the head:

export GLOO_DEBUG=2
export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_SOCKET_IFNAME=enp45s0
export GLOO_SOCKET_IFNAME=enp45s0
export NCCL_ASYNC_ERROR_HANDLING=1
export RAY_memory_monitor_refresh_ms=0

ray start --head --port=6379 --memory 24000000000

For the node:microsoft/Phi-3.5-mini-instruct

export GLOO_DEBUG=2
export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_SOCKET_IFNAME=eno1
export GLOO_SOCKET_IFNAME=eno1
export NCCL_ASYNC_ERROR_HANDLING=1
export RAY_memory_monitor_refresh_ms=0

ray start --memory 12000000000 --address='192.168.50.199:6379'

In prod, we’d see this as a service discovery challenge too, but, here, we’re OK with some static IPs.

Now, let’s see what Bijan’s vllm serve command looks like:

vllm serve microsoft/Phi-3.5-mini-instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --gpu-memory-utilization 0.8 \
  --max-model-len 4096 \
  --host 0.0.0.0 \
  --trust-remote-code

I had to fiddle with it, because that’s a 2 node / 4 GPU setup, and it took me a few tries, but, I wound up getting this warning which I really like in my own way:

WARNING 05-09 06:03:18 [ray_utils.py:199] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node d97d078c6e6c301dafb330684dd1aede943dcbe95dabc32bfc3fa5c7. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.

It mentions Infiniband – which is the kind of thing I see get a lot of use in HPC networking scenarios, awesome. I don’t have that setup, that’s yet another blog article! But, after trying a number of combos, and failing, I think that warning is actually good news in our case.

For my test, I wound up choosing the same model Bijan (the YT creator) used, he’s aiming for something small and simple, and that works for me…

Now, we just run the serve command on the head, and let it fire up….

We can see that we get a process loaded on the node, yoda.

root@yoda:/vllm-workspace# nvidia-smi \
  --query-compute-apps=pid,process_name,used_gpu_memory \
  --format=csv,noheader,nounits
2701, ray::RayWorkerWrapper.__ray_call__, 12104

And then I tried

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3.5-mini-instruct",
    "prompt": "Name two former or current US senators from Vermont.",
    "max_tokens": 150,
    "temperature": 0.7
  }'

This calls for a: wewwwwwwwwwwwwwwt! There’s a successful completion!

doug@stimsonmt:/tmp/testgrammar$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3.5-mini-instruct",
    "prompt": "Name two former or current US senators from Vermont.",
    "max_tokens": 150,
    "temperature": 0.7
  }'
{"id":"cmpl-8606a78a9d6b4a0ba7d8290c0d8e6893","object":"text_completion","created":1746797459,"model":"microsoft/Phi-3.5-mini-instruct","choices":[{"index":0,"text":"\n\nVermont, a state known for its picturesque landscapes and progressive political stance, has had several distinguished individuals serve in the United States Senate. Here are two former and two current US senators from Vermont:\n\nFormer Senators:\n\n1. Patrick Leahy: Serving since 1975, Patrick Leahy is a Democrat who has been one of the longest-serving senators in the history of the United States. He was re-elected multiple times and has had a significant impact on national policy and legislation.\n\n2. Jim Jeffords: He served from 1989 to 2007 as a Republican senator","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":12,"total_tokens":162,"completion_tokens":150,"prompt_tokens_details":null}}

Why it somehow mentioned Jeffords before Bernie is…. An absolute mystery to me, that seems so unlikely! But… Alas, it’s also correct. And… Hurray!

And guess what? It also can use the structured output!

doug@stimsonmt:/tmp/testgrammar$  python test.py "shredding a quick uphill lap at Bolton after work"
rad

I’m also trying to figure out how to contribute some of this learning back to the codebase! Thankfully Russell also pointed me at the vllm ci infra project. And I even joined the vLLM slack!

New Fedora 42 workstation with wayland and a Nvidia GPU, who dis?

30 Apr 2025

I had been putting off upgrading Fedora because… Wayland is becoming unvoidable. And I’ve been avoiding it for a couple releases now. And Fedora 42 is out. And since it’s the answer to life, the universe and everything, I can’t deny it. Wayland is the future.

Primarily because I’ve been an i3wm user, it’s a “twm” (tiling window manager, or that’s what I call it). And I’ve gotten mighty cozy in it, but eventually, you have to move on! So, alas, I’m going back to GNOME, and getting my life setup with Wayland.

My main desktop workstation is, and has been Fedora for… basically as long as I’ve been a professional.

fedora core 3

Yep! Fedora Core 3! Back in my day! Err, before Fedora 7, we called it Fedora Core

People can remember where they were during major historical events, and I can remember where I was when someone told me that I probably wouldn’t be using Red Hat Linux anymore, but instead… Fedora. I sandbagged for a release, I think, and eventually got on board at Fedora Core 2. Unfortunately I only have photographic proof of FC3! Apparently I do that and sandbag a release or two, still doing it!

Here’s the thing – I have a bunch of great machines my lab, a media workstation (for graphics, both traditional and AI/ML as well as music production), a GPU lab machine (for training runs and long running jobs), and a virtualization host, and… Also a primary developer’s workstation. That is kind of the vehicle to use all the better machines, so, it hasn’t gotten love in a while. But I knew I had to upgrade, and… That machine was getting a little draggy, so, time to upgrade.

Getting my life back together as a i3wm user.

My good bud Leif mentioned this article from sudoscience about turning gnome into a tiling window manager. He got me into i3wm, so, thanks for throwing me a line.

I settled on trying a few things that I wound up keeping and liking, so far…

Pop! Shell
Blackbox terminal
input-leap as an upgrade to Barrier

I use Barrier so that I can use my work laptop, my media workstation, and my developer’s workstation all at the same time. But… Guess what? Apparently the maintainers have started their own fork, input-leap. That’s probably good news, but… I did struggle a bit. Wayland has some nice security considerations, so, you can’t start it without manually accepting the remote connection, which is… Sort of a bummer, so I have to KVM switch (which is really just a USB switch for my inputs) in order to wake the machine and accept the connection. The media workstation needs to be primary – especially because it uses a graphical pen tablet.

I also had to hack together some stuff to SSH my clipboard (yuuuuck) to be able to use it conveniently between boxen, because, that’s also a security issue. So, remote control, not as good, but… I’m going to live with it and work around it for now.

Pop Shell is treating me fairly well, at least I’m able to do most stuff that I could do before, just some clunking around, still getting all my chords straightened out.

Getting my Nvidia GPU to fire up.

I have other better GPUs in my lab, but I needed something in this machine, I wound up opting for a 5060ti because I could both get a good deal, and these things have a bunch of VRAM these days, so, I was kind of pumped to have that much vram around, at least it makes some things possible even if it’s got way less CUDA cores than my other GPUs in the lab.

However, I’m kind of in the early adopter phase, both a 50-series and an O/S that was just released 2 weeks ago.

Thank heavens for if-not-true-then-false (inttf)! I have used these articles for Fedora setups for a while now, that author deserves an award.

These saved my bacon. I tried installing from fusion repos, and I wasn’t having good luck. I did figure out that my card does require a beta driver, so… That might be related.

This reddit post about secure boot with Nvidia drivers in Fedora was also hugely helpful. I can’t seem to disable on my motherboard (which I looked into), but it forced me to figure out how to sign the driver (which isn’t bad).

Mostly, after I got going with the inttf articles, I was making progress. I did have to reboot umpteen times while I really screwed it up, and also tested my grub2 know how (I wasn’t getting a clean boot when I first plugged in the video card, naturally). Unfortunately, I’m still getting a slow boot, an I lost graphical boot, despite inttf’s awesome Plymouth setup. I think I can remember losing graphical boot… many… many times in my Fedora career. I tried a few times, and I’ll try again as the drivers get updates and whatnot, but for now, I’m living with it.

Sawdust and Serial Ports -- Rescuing a 90s CNC Router

02 Apr 2025

A couple of summers ago my dad was telling me this story where a customer was in the show room of his sign shop. A customer’s kid, probably prompted by the CRT monitor and yellowed plastic of a late 1990s eMachines computer sitting on a desk in the showroom, announces…

Hey look Mom! It’s an antique computer!

the antique computer, err monitor and keyboard, in all its glory

Which I found absolutely hilarious, but… I didn’t realize that PC was still alive – it’s a workstation that operates a CNC router that’s a core part of my dad’s business, Briar Hill Signworks (don’t worry, his website has a 90s feel to it too! But the signs are awesome). He specializes in what I’d consider New England-style carved gold leafed wooden signs. They’re gorgeous and he’s really perfected the craft, as well his partner who is a painter and (a quite masterful!) gilder, meanwhile he’s quite modest about it overall.

So – next thought was “OH MY. You are still running that same PC to operate the CNC router?” It’s a miracle, a damn miracle, that the workstation PC is still running. That eMachines computer has been there for a solid 25 years, dutifully running Windows 98 in order to run some proprietary software, that I later found out, was actually built for Windows 95. I knew it was time to replace that thing, and I told my dad I’d come try to help him when I found some time. Two summers and almost a whole winter passed before I got a fateful call from my dad:

I can’t get files onto the antique computer over the network.

Uh oh, that means production was DOWN at my dad’s shop. Which is why I bring to tell you a tale of this retro computing challenge – quite a departure from my typical tales of open source software, complete with decades old software and hardware, and hardware software license key dongles, and all kinds of other gory stuff.

The eMachines situation.

It’s a got an Intel 266mhz processor running Windows 98 SE, I believe it was originally a sub-$500 machine in maybe 1999. It’s got a floppy disk drive, and a CD-ROM drive – both of which have since failed and both are inoperable. Amazingly, it’s got a NIC attached to ethernet and my dad’s been sending it files (from another machine, one that he uses more-or-less modern software to design the signs) using Windows File Sharing.

The router is connected over a DB-25 pin cable which is converted to DB-9 pin and works over a serial COM port (or that’s how it was last configured, so I kept it that way!). My dad uses two pieces of software on this machine primarily, CASMate and Enroute – one is like a CAD for sign design and Enroute is used for creating the tool paths (the instructions to send to the CNC router), and then it also has the drivers to actually communicate with the router itself. But you see – this was rather expensive software at its time, and it’s protected with a software protection dongle. I think specifically a HASP (which stands for “Hardware Against Software Piracy”) dongle, one that works over a DB25 LPT (parallel printer) port.

the HASP

This is the HASP (attached to the new PC workstation), what a relic of a different era.

My dad’s router is a Vytek Rebel 3D. He bought it in 1996, and it was spiffy for its time – and honestly still really is to this day in a lot of ways (minus the outdated software stack). It’s got a 50x50” addressable table with a bit more table outside of that, and the thing is a tank, hell it’s nearly 30 now. See – it’s a 3D router. Not 2.5D. With 2.5D, the tool can move in X & Y at the same time, but, Z independently – with a 3D machine, you can move in all three axis at once. It’s critical for making the beautiful v-groove lettering of a carved sign. I thought this might be a software limitation, but nope, it’s also apparently a hardware limitation – there’s mechanisms for true Z axis movement, whereas a 2.5D machine can use a solenoid to step along the Z axis.

But, that company doesn’t produce CNC routers anymore, they seem to do CNC laser stuff these days, and… It seems non-trivial to replace the head unit on it.

Growing up, my father started this business in a woodshop that was an outbuilding to our home, and he ran the CNC router in our garage before he moved the business to a (rather scenic) red barn in Sutton New Hampshire (Google street view). As a budding young tech person – I was STOKED for the CNC router to arrive. My next door neighbor was also pretty excited and he RAN to the house to let us know the flatbed delivering the router was pulling up. The CNC router seemed like pure magic. Fascinating to watch it carve a sign. When my dad bought it, it came with a service where someone came to your location to train you how to use it. I begged my dad to let me stay home from school so that I could also learn how to use it. He said no – if he could go back in time knowing I’ll be there to help now, I’ll bet he’d say yes, but… Seriously who needs a know-it-all 15 year old (who should probably be at school) bugging you when you’re trying to learn something critical for building your business. I still learned a lot along the way (and even worked at the shop for a year, too).

the barn

My original game plan was to take the HDD out of the eMachines computer and hook it up with an IDE adapter dd it and capture an image so that I could run it as a VM. But I explained the risks to my dad – this computer is 25, and in computer years that’s 175, so, any operation – including taking the hood off to pop out an HDD – might totally turn this thing to dust.

He didn’t want to do that. Understandably. So, I did it the hard way, a stare-and-compare, and I also was lucky enough to get some files off of it over the network. My dad in desperation had jiggled the ethernet cable attached to it and got the machine running for a few more jobs before I could make it to the barn to help out. Maybe it was just a dodgy network cable that was the problem all along – who knows, maybe that thing would’ve gone another 25 years? But, regardless.

My dad was losing sleep over this. I get it. It’s production. It’s how you make money. I really looked at it as a fun retro computing challenge, and a good way to help family. But, it’s family and it’s business, and it’s production. Guy was freaked, and I don’t blame him. This kind of stuff also BOTHERS me. I don’t like it when production is down, that bothers my most basic instincts.

In preparation and the first day on site…

I spec’d out a new school workstation – I was going to have him buy a laptop, but a buddy of mine recommended that a parallel or serial to USB converter might be too touchy with the precision instruments, so… I looked for a kind of middle-of-the-road workstation and had him get a Lenovo desktop that had a serial and parallel port – and I also had him buy a smattering of other things at the same time, including PCIe serial and parallel cards, a handful of gender changers for the parallel ports and some peripherals.

The lamest part of the thing is that I opted to have him use VMWare Workstation (which is luckily is a free license these days). Without getting into the details, the gist was that my dad is a windows user, running an old windows for this machine, on a new windows, and we needed the parallel port support – which is since no longer supported in the latest versions, so I may have advised borrowing an old version from archive.org (I didn’t learn that until after the first attempt with the latest version, we settled on a 16.x version).

I figured I’d build out the VM, install the apps, get the dongle working, then try to talk to the router. Easy, right?

Famous last words.

I spent the first day mostly stubbing my toe on the vmware portion, and then guess what I did… Installed old software via CD-ROM drive. Click, bzzt, whirrrrrrr! My dad had saved all the old software and had the CD-ROMs, so, I went through installing it all. I even did it twice because I tried to mess with some windows settings after the first pass, and… I corrupted the O/S haha – yeah. I forgot to take a snapshot, so, I was pretty sure to take snapshots after that.

I used a bunch of help from ChatGPT, it’s great at generating some instructions for what to look for in an old Windoze that I have long LONG forgotten about. Got it hooked up to the file sharing network again, stuff like that. I also ran some troubleshooting scenarios with it.

And… the tour de force of the first day – I got it to talk to the parallel port dongle and got the proprietary software to load reading whatever auth info from the HASP. I knew a lot of things were going just right to have that happen.

yup, you can virtualize that

Remember that fun visual error? It’s like… a failure to clear the video buffer or something? Looks like a win in windows solitaire. I tried to Google the name of this error, there isn’t one, apparently

My dad’s partner also had brought some really nice healthy lunch, looked sort of like a bento box with lox and veggies, amazing.

I also got it to talk to the router, kind of. At least… sort of. The router would move to the first position but when it went to drive the tool down the Z-axis, it would go…. excruciatingly slow. I tried a bunch of combinations of different ports, with DB9 adapter and without, onboard parallel and PCIe parallel, that kind of stuff. But wasn’t getting much further.

I don’t often get to shout “MOVE!” while physically at a workstation, so my dad was subject to many Nick Burns, Your Company’s computer guy jokes throughout the process. I’d try a configuration, then, not remembering all the pieces for what to do, my dad would jump in and start generating toolpaths or sending the job to the router. I noticed that one thing was weird with the job manager, it was like… The process we were using was different, it seemed like it wasn’t what my dad was doing. I knew something was different, what was it? It would haunt me for a week.

Have to admit, I was bummed to have to put it down at the end of the day, and felt like I was getting close, but… Until it would route a job – it wasn’t over. And as we know, anything can go wrong, at any time.

I explained the problem to a few friends throughout the course of the project, and a good friend mentioned (paraphrased) “the slow Z movement sure smells like a driver issue.” I would keep stewing on that.

But there were things that were retro tech cool that I wound up doing that were successes in their own right, just for the fun of it – I brought up Hyperterminal which I hadn’t seen in like, probably 25 years too. I opened up regedit, and boy oh boy – did I hit windows key + pause/break dozens of times to look at the device manager.

The second day, the next week…

I made it down to the shop for a second day. 90 minute drive through the Green Mountains, down into the Connecticut River Valley on a March day, complete with slushy roads and bad coffee from a gas station. Side note: How can those like “ground coffee on demand single serve machines”, even Green Mountain Coffee, taste so mediocre? Oh well, I still drank it all morning.

Because one idea that’s just a science experiment isn’t enough. I also advised him to try the latest edition of Enroute – which you can buy on a subscription (which is nice compared to the lump sum with no upgrades, as we know from the rest of this story huh). So, we started the day with that. It wasn’t just a cake walk, in fact, it was more of a stepping into dog piles kind of walk. Granted, their support was really nice (thanks Jerry!) and we even got to get a laugh out of support when I showed him Enroute 2.1 running on Windows 98. However, they didn’t have an immediate solution regarding drivers for our machine, so, by noon time, they’d hit a wall and sent a request to an engineer to get back to us.

Additional benefit of being on the phone with support, a rare opportunity to call my dad “Cheif”. No one likes calling their dad “Dad” during a business call, and it’s even more awkward to call him by his, you know, actual name. Let’s just be honest, just seems weird on both fronts. My own father had worked many years with my grandfather, his father, and he also encountered this as well – so he called his dad “Cheif”. Incidentally, that’s also my dad’s grandfather name, Cheif. So, I got to call my dad Cheif during the calls.

In parallel while waiting on support, I was also working through the setup of the virtualized old eMachine.

It was time to get to basics: Stare and compare. What’s different? Spot the difference between these two pictures. I went step by step through all the stuff I needed to look at. One by one.

This wasn’t exactly what I was comparing, but to give you an idea, it’s actually how the “plate” is defined, how you define where the substrate is on the table.

There was one thing that I knew of and I was seeing, there was a dialogue box that showed the path of the drivers, and it was different on the old machine and on the new VM. One was using C:\enroute\ODrivers and the other was using C:\enroute\NDrivers – I had seen nomenclature during the install for “new drivers” and “old drivers” – guess what? The old machine was also using the ODrivers directory, so, even for the old software, we were using the old drivers.

The thing was… I couldn’t change that path. No way to set it. What was I to do?

Well, file contents search for NDrivers did the trick – there was a C:\windows\enroute.ini – and guess what was there? A driver path along with a bunch of other options, including coordinates for where your toolbars go and your whole setup. So I tested changing the driver directory.

My dad sees me editing the .ini file in notepad, and he asks:

Dad: So, you type that, and… it does something? Me: I hope so.

It worked to change the displayed driver directory, so I copied the whole .ini file over from the old computer. There was actually a paramter named something old UseOldStyle=true on the old computer.

Things started changing QUICKLY. I called my dad over to run a test.

I knew it was good when the job manager came up to send the job to the CNC router – this time the process was EXACTLY the way it was when my dad would send a job. He knew the interface immediately – must’ve been the UseOldStyle, that’s what he knew.

We’d be calling back and forth from the show room where the PC workstation is, and the production floor where the router is in the next room over, I’d be yelling “Ready?” and my dad would respond “READY.” and we’d send the job. This time though, my dad’s instincts kicked in and he removed a checkbox to send the job, he already knew the router was ready.

I watched the progress percentage in the job manager as it sent the job to the router, something like 200KB going over virtualized COM port to the router, in all its 1996 glory.

The router sprung to life. BZZZZT, BRNNNNNNNNNNN, BZZT, BRRRN. Holy smokes…. We got the job to run.

Not gonna lie – I had to hold back a tear. Tears of relief, tears of joy. We got production back online, and at least in a somewhat sustainable fashion that isn’t teetering on a budget eMachines PC. My dad even said “I think I could cry.” It was an extremely incredible moment with my dad. Seriously a cherished moment. I’ve had some moments in my tech career where I was so excited that I did cartwheels in an office (apparently I was younger then, I had gotten Asterisk to talk to a Lucent 5E switch!), and even one time I ran out onto the street and yelled with joy! (I found a problem with a backup that was causing a daily outage for weeks in my days as a telco switch tech). But, I can’t remember another that quite touched me like that.

Then reality hits: MAKE A VM SNAPSHOT NOW. We had success.

There was a few rough edges, some files would cause a BSOD (Wikipedia) when we’d try to import them into Enroute, after me watching a windoze VM bomb out with a BSOD a hundred times, I finally figured out some potion that was “good enough”. He also had a way to set a specific 0,0 X,Y coordinates for a home position using CASMate, but I couldn’t get CASMate to talk to the router yet, and no obvious way to do it in Enroute, so, we wound up working around it by figuring out a different method for defining “the plate” (as in the above photo).

I kind of figured I’d have to give my dad a whole walkthrough later after getting it all together, but, I think he got the gist of using the VM, the rest of the software running in the VM – that he knew. But, he didn’t need a walk through, I just got a video of him carving a sign. I’ve watched it more times than I’m willing to admit to.

Kubernetes Dynamic Resource Allocation (DRA) for... Networking?

31 Jan 2025

So, are you familiar with DRA? Dynamic Resource Allocation (k8s docs linked)? It’s for “requesting and sharing resources between pods”, like, if you’ve got a hardware resource you want to use, say, a GPU in your pod, or maybe you’re, like, cooler than that and you wanna connect a smart toaster to your pods… Well, you could use DRA to help Kubernetes schedule your pod on a node that has a connected toaster, or, well, yeah a GPU. Popular for AI/ML or so I hear, have you? :)

Today, we’re going to chat about using DRA for networking, and I’ve got a pointer and an opinionated tutorial and how to run through using an example DRA driver so you can get your hands on it – so if you’d rather that, you can skip my rundown – fire up your scroll wheel on turbo or start spamming page down!

DRA vs. Device Plugins

It’s starting to become a popular idea to use DRA for Networking. This isn’t necessarily groundbreaking for networking in Kubernetes, you know, we do use hardware devices for networking, and indeed people have been using Kubernetes device plugins to achieve this for networking. And in fact, the SR-IOV network operator which is maintained by folks in the Network Plumbing Working Group (NPWG) uses device plugins to allocate high performance NICs to pods in Kubernetes. You, however, won’t be surprised to hear that there was a community get together at the Nvidia HQ in California to form a working group which came up with the device plugin spec, back in 2018, I attended because, well, I’m a Kubernetes networking guy and I was also interested in making sure my pods wind up on nodes with available hardware resources for specialized and high performance networking.

So – why now is DRA becoming popular? Well, device plugins did “get ‘er done” – but, they are limited, at the end of the day, device plugins basically require that you have a daemon running on each node and advertise the availability of nodes to the Kubelet, basically as a simple counter, like, chartek.io/toaster=5 if you’ve got 5 smart toasters connected to your node. Your daemon would realize one has been used and decrement the counter. With DRA – we can have much richer expression of those toasters.

Now imagine, you’ve got a variety of toasters. So, it’s like… Just saying 5 toasters isn’t enough. Now you want to be able to use a toaster that has a bagel function, or one with computer vision to analyze your bread type, and actually, there’s even toasters that will toast a gosh dang hot dog.

Now here’s the thing – if you need to divvy up those resources, and share that technology between multiple pods… You can do that with DRA. So if you’ve got a need to have a pod using the hot dog slot and one using the bun slot – you can do that. So, you could setup a resource class to divide the methods of your device, like so:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClass
metadata:
  name: toaster-bun-slot
driverName: burntek.io/bun-toaster
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClass
metadata:
  name: toaster-hotdog-slot
driverName: burntek.io/hotdog-warmer

Whereas with device plugins – you kind of have to allocate the whole toaster.

This might seem ridiculous (because it is), but, there’s an entire effort dedicated to ensuring that hot dog toasters work with Kubernetes – I mean, there’s a whole effort dedicated to “Structured parameters for DRA”, which I believe is related to the idea of allocating multiple workloads to a GPU, like say you have a GPU with 48 gigs of VRAM, if one workload claims it – but it only uses 12 gigs of VRAM, well, the rest of the VRAM is kind of going to waste, so, you could allocate multiple.

Think of it this way…

Device plugins: Ask for a toaster, you get a toaster, that’s it.

DRA: Ask for a toaster, and the toasting technologies, and you get the precise toaster you need, allocated at scheduling time, and shared with the right pods.

Why DRA for networking?

See – this is what’s cool about Kubernetes, it’s an awesome scheduler – it gets your workloads running in the right places. But, it doesn’t know a whole lot about networking, for the most part it just knows the IP addresses of your pods, and indeed it can proxy traffic to them with Kube proxy, but… For the most part, Kubernetes doesn’t want to mess with infrastructure specific know-how. Which is why we’ve got CNI – the container networking interface. With CNI, we can kind of free Kubernetes from having infrastructure specific know how, and let your CNI plugin do all the heavy lifting.

Device plugins, while they could do the job generally, have only that counter. This was also a big challenge with SR-IOV. You see, device plugins kind of weren’t really designed for networking either, and here’s the thing…

The design of device plugins was weighted towards allocating compute devices, like GPUs and FPGAs. SR-IOV network devices don’t need just allocation, they also need configuration.

So, the Network Plumbing Working Group put together, “The Device Info Spec” [github]

This allowed us to store some configurations, which we laid down on disk on nodes, in order for the terminal CNI plugin to pick up that configuration information, which we needed to do at the time, but… It’s got a lot of moving parts, including the use of Multus CNI as an intermediary to pass that information along. I’m a Multus CNI maintainer and I sort of wince whenever I have to go near that code to be frank with you. And when I say “frank” – I don’t mean frankfurters (actually, yes I do).

Using DRA for these kind of cases represents a way to potentially simplify this, we can have customized parameters that go all the way down, without having to have some kind of intermediary.

But, there’s some potential downsides…

Here’s the other part… This might just totally bypass CNI. Which, basically every Kubernetes community user/operator/administrator really relies on today. It’s THE way that we plumb network interfaces into pods. It’s kind of the elephant in the room to me, there’s sort of two sides to it…

CNI was designed to be “container orchestration agnostic”—not Kubernetes-specific. That was a smart, community-friendly move at the time, but it also means CNI doesn’t have deep Kubernetes integrations today. If you’re an “all day every day” kind of Kubernetes developer, CNI looks like a total non-sequitur which can only be described as “a total pain in the buns” (not hot dog buns, but, pain in the butt). If you want to hear more about this, check out my lightning talk about this at FOSDEM in 2023.

The other side is… CNI is also extremely modular, and allows components to interoperate. It allows you to customize your clusters and integrate say, your own solutions, with a primary CNI that you choose, and even integrate vendor solutions.

What I’m afraid of is that if we ignore CNI: We’re going to blackbox things from people who do manage their own clusters. We might also blackbox it from people who support those clusters – even in a proprietary public cloud provider context.

CNI provides a lingua franca for this kind of work with networking.

On the bright side…

We might have a better world coming! And there’s a lot of people working VERY hard to make it happen.

I think that potentially that the CNI DRA driver is probably the thing that will help us iron out a bunch of these kinks.

The primary author and legendary rad developer, Lionel Jouin has been doing A LOT in this space, and has been putting together some excellent PoCs for quite a while to really explore the space.

Lionel was also pivotal in getting a “resource claim status” enabled – see: (github.com/kubernetes/enhancements#4817)[https://github.com/kubernetes/enhancements/issues/4817] & (github.com/kubernetes/kubernetes#128240)[https://github.com/kubernetes/kubernetes/pull/128240]

Keep your eyes on the DRA CNI driver repo and track it – and, best yet – think about contributing to it.

Also, you should check out this KEP to support dynamic device provisioning from Sunyanan Choochotkaew which also helps to address some challenges found when working through the DRA CNI driver.

Other resources to check out…

DRAcon: Demystifying DRA on YT, at Kubecon EU 2024
If you’re interested in DRA, you should also see this awesome article by Saifeddine Rajhi.

Let’s get onto the tutorial!

Pre-reqs…

Fedora (but you could adapt it to your own system)

Get docker installed…

sudo dnf -y install dnf-plugins-core
sudo dnf-3 config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl start docker
sudo systemctl enable docker
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker ps

Install kind:

[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
kind version
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
kubectl version

And you’ll need make.

sudo dnf install -y make

Install helm (dangerously!)

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

And then we’re going to want to spin up a registry in kind

I use the instructions from kind – which includes a script. I ran that without mod.

Push an image to localhost:5000/foo/bar:quux to see if it works.

Install go.

wget https://go.dev/dl/go1.23.4.linux-amd64.tar.gz
sudo rm -rf /usr/local/go 
sudo tar -C /usr/local -xzf go1.23.4.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

Spinning up the DRA example driver.

This is – basically right from the DRA example driver README, but, with my own experience, and install steps for the tools above, of course… But, let’s check it out to see the basics.

Then I checked out…

git clone https://github.com/kubernetes-sigs/dra-example-driver.git
cd dra-example-driver

From there, I spun up a 1.32 cluster…

./demo/create-cluster.sh

Build and install the example driver image…

./demo/build-driver.sh
# You could load or reload it...
kind load docker-image --name dra-example-driver-cluster registry.example.com/dra-example-driver:v0.1.0
# and where its used from...
cat deployments/helm/dra-example-driver/values.yaml | grep registry

And helm install it:

helm upgrade -i \
  --create-namespace \
  --namespace dra-example-driver \
  dra-example-driver \
  deployments/helm/dra-example-driver

And that it works.

kubectl get pod -n dra-example-driver

Now, you’ll see they created resource slices…

kubectl get resourceslice -o yaml

You could also use their spinup, and config, which I borrowed from.

./demo/create-cluster.sh
# You can check out what's in their config with:
cat ./demo/scripts/kind-cluster-config.yaml

Run their example…

kubectl apply --filename=demo/gpu-test{1,2,3,4,5}.yaml

And use their… bashy script to see the results…

#!/bin/bash
for example in $(seq 1 5); do \
  echo "gpu-test${example}:"
  for pod in $(kubectl get pod -n gpu-test${example} --output=jsonpath='{.items[*].metadata.name}'); do \
    for ctr in $(kubectl get pod -n gpu-test${example} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \
      echo "${pod} ${ctr}:"
      if [ "${example}" -lt 3 ]; then
        kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
      else
        kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
      fi
    done
  done
  echo ""
done

And it’s a bunch of data. I mean, stare-and-compare with their repo and result, but…

You might have to dig in deeper in what actually happened – like, where do those resourceslices come from?

Now you can delete their stuff:

kubectl delete --wait=false --filename=demo/gpu-test{1,2,3,4,5}.yaml

And, let’s remove their cluster:

kind delete cluster --name dra-example-driver-cluster

Network DRA! Lionel’s POC

So! This part is incomplete… I wish I got further, but I got stuck with mismatched resource versions. DRA went to v1beta1 in K8s 1.32, and… I think that caused me a bunch of heart burn. But, I’m keeping this here for posterity.

Now let’s look at Lionel’s project: https://github.com/LionelJouin/network-dra

git clone https://github.com/LionelJouin/network-dra
cd network-dra/

Following his directions I tried make generate with errors. No bigs, let’s carry on.

make REGISTRY=localhost:5001/network-dra

Then, make the customized k8s

git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
git remote add LionelJouin https://github.com/LionelJouin/kubernetes.git
git fetch LionelJouin
git checkout LionelJouin/KEP-4817
git checkout -b lionel-custom

And then you can build it with:

kind build node-image . --image kindest/node:kep-4817

I had:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  DynamicResourceAllocation: true
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri"]
    enable_cdi = true
nodes:
- role: control-plane
  image: kindest/node:v1.32.0
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
        extraArgs:
          runtime-config: "resource.k8s.io/v1beta1=true"
          v: "1"
    scheduler:
        extraArgs:
          v: "1"
    controllerManager:
        extraArgs:
          v: "1"
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
- role: worker
  image: kindest/node:v1.32.0
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"

Their cluster…

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  DynamicResourceAllocation: true
containerdConfigPatches:
# Enable CDI as described in
# https://tags.cncf.io/container-device-interface#containerd-configuration
- |-
  [plugins."io.containerd.grpc.v1.cri"]
    enable_cdi = true
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
        extraArgs:
          runtime-config: "resource.k8s.io/v1beta1=true"
    scheduler:
        extraArgs:
          v: "1"
    controllerManager:
        extraArgs:
          v: "1"
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"

Lionel’s:

---
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: dual
  kubeProxyMode: ipvs
featureGates:
  "DynamicResourceAllocation": true
  "DRAResourceClaimDeviceStatus": true
runtimeConfig:
  "networking.k8s.io/v1alpha1": true
  "resource.k8s.io/v1alpha3": true
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri"]
    enable_cdi = true
  [plugins.'io.containerd.grpc.v1.cri'.cni]
    disable_cni = true
  [plugins."io.containerd.nri.v1.nri"]
    disable = false
nodes:
- role: control-plane
  image: kindest/node:kep-4817
- role: worker
  image: kindest/node:kep-4817

My portmanteau…

---
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: dual
  kubeProxyMode: ipvs
featureGates:
  "DynamicResourceAllocation": true
  "DRAResourceClaimDeviceStatus": true
runtimeConfig:
  "networking.k8s.io/v1alpha1": true
  "resource.k8s.io/v1alpha3": true
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri"]
    enable_cdi = true
  [plugins.'io.containerd.grpc.v1.cri'.cni]
    disable_cni = true
  [plugins."io.containerd.nri.v1.nri"]
    disable = false
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"
nodes:
- role: control-plane
  image: kindest/node:kep-4817
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
        extraArgs:
          runtime-config: "resource.k8s.io/v1beta1=true"
    scheduler:
        extraArgs:
          v: "1"
    controllerManager:
        extraArgs:
          v: "1"
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
- role: worker
  image: kindest/node:kep-4817
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
- role: worker
  image: kindest/node:kep-4817
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"

And:

kind create cluster --name network-dra --config my.cluster.yaml

And I have running pods…

Load the image…

kind load docker-image localhost:5001/network-dra/network-nri-plugin:latest

Install plugins…

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/e2e/templates/cni-install.yml.j2

You can uninstall with:

I had to push to a public registry…

helm install network-dra deployments/network-DRA --set registry=docker.io/dougbtv

docker tag network-nri-plugin dougbtv/network-nri-plugin:latest
docker push dougbtv/network-nri-plugin:latest

helm uninstall network-dra

Now let’s look at the demo…

cat examples/demo-a.yaml

Most interesting, we want to look at:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: macvlan-eth0-attachment
spec:
  devices:
    requests:
    - name: macvlan-eth0
      deviceClassName: network-interface
    config:
    - requests:
      - macvlan-eth0
      opaque:
        driver: poc.dra.networking
        parameters:
          interface: "net1"
          config:
            cniVersion: 1.0.0
            name: macvlan-eth0
            plugins:
            - type: macvlan
              master: eth0
              mode: bridge
              ipam:
                type: host-local
                ranges:
                - - subnet: 10.10.1.0/24

Well – look at that, looks like a YAML-ized version of the macvlan CNI config!.

That is, because it is.

And, the pod has a little something to look at:

  resourceClaims:
  - name: macvlan-eth0-attachment
    resourceClaimName: macvlan-eth0-attachment

And then I got stuck – can you get further?

Older Newer

dougbtv