Well, doggies! When the LTX-2 video model landed last week, I went ahead and brewed up an experimental implementation of it in vllm-omni, the omni-modality serving layer for vLLM.
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
I was going to try it out in ComfyUI, but I decided “hey maybe I can brew up an implementation based on the Lightricks/LTX-2 GitHub repo”, so I got to it.
Today for you – I’ve got a Docker image published with my work, up on quay.io/dosmith/vllm-omni:ltx2-experimental and instructions to fire it up, and then followed by some notes on the implementation.
Since I’ve been experimenting with vllm-omni recently, I wanted to see if I could brew up an implementation – I’ve at least got it producing video (no audio! I wanted to PoC it before I tackled it):
Also, I tried to do the “infamous Will Smith eating spaghetti” (wikipedia page) (holy smokes this has a wiki article!? I thought it was a /r/StableDiffusion meme!). Turns out… I suspect the model must’ve been scrubbed for copyright infringement type problems, and… It didn’t look like Will Smith at all (so I omitted posting it!).
These 1024x1024 10-second videos took me about 3 minutes end-to-end to run vLLM-omni and produce, on a single A100 GPU.
I’ve got a docker/Dockerfile.ltx2 there with how I built it as well.
How I wound up getting there
This work is primarily meant to be experimental. I wanted to see if I could “make it work” so that I could try out both LTX-2, and adding a new model implementation in vLLM-Omni. I’m not sure where it stands in terms of permanence.
I used both the LTX-2 GitHub repo, and the ComfyUI implementation as references. Then used a combination of ChatGPT (with GPT-5.2) to orchestrate prompts for both Claude as well as some back and forth with Codex to check the work. I used these tools to then leverage the LTX-2 code primarily, with the LTX‑2 reference stack via the ltx_core + ltx_pipelines libraries, which provide the model components (transformer, VAE, schedulers, helpers) used by the vLLM‑Omni LTX‑2 pipeline in my branch.
In summary from Codex…
Introduced an LTX‑2 diffusion pipeline with model loading, distilled 8‑step denoising, video‑only (no audio), and post‑processing.
Added LTX‑2 configuration helpers to enforce dimension constraints (32‑pixel multiples + 8K+1 frames) and provide defaults for fps, inference steps, and frame count.
Registered LTX‑2 in the diffusion registry and wired up a dedicated post‑process function for decoding video outputs.
Shipped a detailed offline example that shows how to run LTX‑2, extract pipeline outputs, and export an MP4 using PyAV.
Debugging: the VAE problem
The initial implementation seemed feasible, but… I ran into some issue with what I recognized was probably the VAE decoding. How’d I know that? Well, if you were fiddling around with StableDiffusion 1.X when it was first coming out, you’d see this often if you misconfigured something. Also – you’d see people post similar problems on /r/stablediffusion in those days.
Ready to generate images using vLLM-Omni for inference, but with ComfyUI as the front-end? Here we go!
Yup, that’s a steam punk skier generated in ComfyUI with vLLM in the loop!
I love myself a good diffusion model. Generative AI for imagery is what sparked my initial interest in the wave of gen AI, and I got really excited when I found out about vllm-omni – a framework that expands vLLM for “omni-modality model inference”, that is, other modes and in combinations of: text, images, video, as input and/or output. This also means for the first time that we can use diffusion models served from vLLM. Needless to say: I’m PUMPED. I’d been meaning to try the Qwen3-image model because it’s been seeing a lot of hype in /r/stablediffusion in the recent past.
I took the opportunity to build a PoC where I extended vllm-omni with an endpoint (modeled after the DALL-E OpenAI endpoint), so that I could then build a ComfyUI custom node. We’re going to have the vllm-omni portion deployed in docker, and you can deploy comfyUI wherever/however you’d like. I was originally inspired by the offline inference examples for vLLM-omni.
Why ComfyUI? Because it’s THE serious tool for putting together workflows for diffusion models. It’s got the production grade kind of things to allow you to compose and extend the UI. It isn’t to say that there aren’t other great front-ends for diffusion models, like, I love automatic1111, but I want the ability to extend on this idea from Comfy, and it gives me a lot of freedom going forward to mix-and-match workflows in a composable fashion.
By the way – the text in this blog article today is formed from “hand-typed, artisanal words,” I wrote them myself because, while I’m using gen AI for lots of work, I really have a pride in my blog. Credit to my buddy @csibbitt for coining the phrase, “artisanal words”. As for the code and examples, they’re copy-pasta and iterated on, so at least the tech steps, partially by the clankers (which sorta feels like a swear word, but, I’m not sure!).
I can’t count the cups of coffee I had today (which means I can’t count over 3), but, I’m just drinking long coffees today because it’s like… part of the psychological pump up for putting together a PoC, gotta have a hot coffee at the ready.
What do I need to get rolling?
Basic gist is that we’re going to:
use Docker to run a build that I made of vLLM-omni, and
then we’re going to install my custom ComfyUI node, comfyui-vllm-omni.
Alright – the requirements are, well, fairly hefty. I’m using qwen3-image: https://huggingface.co/Qwen/Qwen-Image – it might take more than what’s in your prosumer system.
GPU(s) powerful enough to run qwen3-image and vLLM.
I think that’s basically everything you need. I ran vLLM-Omni on a B200 system (overpowered for this model, honestly, but fun), and ComfyUI on my local workstation with a 50-series card (maybe later I’ll integrate it into a multi-step workflow with local inference).
As for limitations: It’s 100% PoC mode. The API is “just a sketch”, I figured following DALL-E’s openai endpoint might make sense. Additionally, it kind of runs as a “stand alone example” in the vllm-omni code, it needs some considerations. For now the goal is “just a text2image API”, but I’m also looking in image-to-image, even as I write (basically!). Also, I’ve only tested this with Qwen3-image, and I’m not sure what else works, or doesn’t…
Take note: I’m using CUDA_VISIBLE_DEVICES=0 because it’s a shared system and I don’t want to accidentally grab another user’s GPU. You can tailor or remove these settings as needed.. Additionally, I was on the (gotta admit, weird) Nvidia ubuntu distro on this b200. So I was using docker and not podman. You can convert it easily.
From there, now you can generate an image using a REST endpoint, like this…
There’s not a lot in terms of… progress or stats yet. But, we’ll get there.
That’s all you need on the vLLM-omni side.
Installing the custom node
Quite easy, what you’ll do is navigate to the custom_nodes folder in your ComfyUI deployment (in my case, from a git clone on my workstation), and then clone my ComfyUI-vLLM-Omni Text-to-Image Node
cd ./ComfyUI/custom_nodes
git clone git@github.com:dougbtv/comfyui-vllm-omni.git
From here, you might need to restart ComfyUI to see it, but then you can add the node.
In my case I browse to the Node Library on the left nav (looks like a chain link) and then search for vllm and you’ll find the node there.
Update the server_url with the correct address for where you vLLM-omni instance is running, and set other parameters as you wish.
Also of note, 1328x1328, 1664x928, 928x1664, 1472x1140, 1140x1472, 1584x1056, and 1056x1584 are the sizes expected to give the best result on these dimensions for qwen-image.
What’s next?! …I might see if I can get an image-to-image workflow, and I’m looking forward to engaging the community to find out what’s next for the project.
I’ve put together an improvement on a rad tool called canhazgpu. My buddy Russell built this tool to replace “ye ole spreadsheet” the team would use to reserve GPUs on a shared machine. It’s an awesome improvement for developers sharing GPUs – canhazgpu handles all the GPU allocation we used to track manually (on a, AHEM, spreadsheet if you didn’t catch that the first time). It’s primarily designed for a single host.
I started to wonder: What if we wound up extending this tool for a cluster of machines?
And you’d guess what my next thought would be: Let’s use Kubernetes Dynamic Resource Allocation (DRA) – well you’d make that guess if you’d read my article about k8s DRA for networking. But it’s partially because this is the kind of thing that DRA is designed for: Scheduling workloads based on limited and changing resources on the cluster, especially those that are hardware bound (Like GPUs [and, yes, sometimes network devices!]). Side note: DRA just graduated to GA in K8s 1.34, about a month ago now!
Today, some context and a demo, and next time, I’ll try to get this to a point where you can spin up your own cluster and allocate workloads on it dynamically. But today – grab your cappuccino and let’s walk through it quick. Well, at least it’s cappuccino o’clock here, and mine happens to have Stewart’s whole milk, as I was just in the Adirondacks this past weekend on some paddle day trips. The locals out there pronounce it STORTS. It’s a convenience store, but also known for their dairy, makes for fabled ice cream stops and was recently awarded the best milk in NY state (again).
Oh yeah – by the way if you didn’t catch it immediately canhazgpu is a reference to the I can has cheezburger meme, which is kind of a king of a meme empire.
tl;dr – Just give me the youtube video!
If you’d like to jump right in and see it in action… Here you go! Make sure to click the chapters if you’re impatient.
…There’s also an asciinema demo later in the article if you prefer that!
Quick history on canhazgpu: Folks in my team have a GPU server that they refer to as “Beaker” (they give Muppet nicknames to their machines), it’s got 8 high powered GPUs, and we recently rebuilt it, and I added automation to keep it under config management. But, Russell was a real hero to the developer experience when he put together canhazgpu – because previously people were looking up who had reserved GPUs in a spreadsheet.
So instead of looking up the GPUs in a spreadsheet, and logging your own use, you’d just type…
canhazgpu run --gpus 2 -- vllm serve my/model
canhazgpu figures out which GPUs are available on the machine, and then sets CUDA_VISIBLE_DEVICES, sets their reservation in a local data store, and then runs the command after the --, sorta like sudo echo make me a sandwich runs everything after sudo.
It’s really pretty too, has a beautiful canhazgpu status to show who’s using what, it uses a heartbeat pattern to figure out who’s using what and when they’re done, and can even detect usage outside of canhazgpu if someone didn’t use it.
The team has adopted it pretty quickly, and Russell’s been sending off super vibe code agents to bug fix it, and of course with him puppeteering it like some kind of puppet virtuoso.
I’ve been working a new role lately, and have been managing some of this infra (including Beaker) in my new role as an MLOps engineer, and it includes provisioning machines for kind of two classes of use: automation systems, and developer workstations. So I’ve been slanging Ansible playbooks.
Well… I got to thinking, what if we could scale this usage for canhazgpu to multiple machines?
On top of that, I’ve been collaborating with the upstream SIG-CI for vLLM. We manage the ci infra for vLLM and are working to improve how our CI works. I’ve been proposing some changes to how Docker images are built and run in order to get some more efficiency, and better developer experience for getting CI signal quickly.
I decided I wanted to combine both of these things:
The ease of use of canhazgpu
Speedy developer test runs of vLLM.
Kubernetes.
A DRA driver based on canhazgpu seemed to be just the right direction.
So I fired up Claude Code (and GitHub Copilot), drank a LOT of coffee, and brewed up this demo…
Why k8shazgpu?
While canhazgpu keeps sharing simple on one box, k8shazgpu scales that same philosophy across an entire cluster.
My goal was to make a Street Fighter-style combo of the ease of canhazgpu with the power of Kubernetes and dynamic resource allocation. The result is a DRA driver and controller that lets you request a GPU on any machine in your k8s cluster. Under the hood, k8shazgpu defines custom ResourceClaims, maintains a cache of images and git repos, and runs node agents that manage GPU allocation. When you ask for a GPU, the controller provisions a vLLM pod with the right image and mounts the cache, so your model starts serving almost immediately.
Key features include:
Declarative resource claims: you ask for a GPU and k8shazgpu uses Kubernetes’ Dynamic Resource Allocation to fulfill the request.
Intelligent caching: cache plans describe what images, repos and models should be pre‑fetched; node agents keep those caches warm.
vLLM integration:k8shazgpu automatically detects when you’re running from a vLLM checkout, figures out the merge‑base, packages your local changes and applies them in the container before launching.
Simple CLI: basic commands like status, vllm run, and cleanup make the tooling approachable.
To make a test run, you’d simply make a call like:
k8shazgpu vllm run -- vllm serve facebook/opt-125m --gpu-memory-utilization 0.8
Behind the scenes, all the magic is happening: The resource claims and the caching mechanisms (that were inspired by my work on improving CI )
Just like canhazgpu, developers using k8shazgpu don’t have to know anything about Kubernetes or DRA. You run the CLI and the controller does the dirty work. DRA itself is a fantastic building block, but it’s not a great end‑user interface. With k8shazgpu we hide that complexity behind a friendly CLI.
Let’s see it in action!
Here’s a demo, quick-like, it shows how you operate from a vLLM clone on a client workstation and make a request using k8shasgpu.
It’s a bit raw (the tmux panes render funny in asciinema), sorry, but it shows the basic workflow of requesting a GPU, running vLLM, and queueing up requests when you’re out of GPUs.
Improvements in flight!
Kubernetes 1.34 shipped in August 2025 with DRA APIs finally graduated to GA. My buddy Miguel Duarte has a pull request the bumps our code to the new v1 API. That means the controller is ready for clusters running the latest release. (For users, nothing would change – you still don’t need to touch any YAML, or even understand the DRA primitives! phew!).
Miguel and I worked on a talk about DRA for networking in Kubernetes at FOSDEM last year, and I’m looking forward to batting around some ideas about how this further pushes the conversation about high performance computing environments – both for AI/ML use cases as well as high performance networking use cases, which I’m more and more convinced are intertwined.
If you’re curious about the code, check out the branch k8shazgpu-demo-worthy. Miguel’s PR is a small but important patch that removes v1beta1 and updates our manifests for DRA’s GA.
Waxing philosophical on next steps.
Russell’s canhazgpu is totally “vibe coded”. So, I build off of that. It’s awesome how far Russell got with it, and he’s clearly great at using the tool and pushing great results out of it. I’ve been kind of “vibe coding” for a few years at this point. Even as an experiment I built a CNI plugin called Surveyor for a FOSDEM talk by writing all the boiler plate with an LLM, and then by copy‑pasting giant hunks of code. These days the agentic technologies are getting better –- Claude, GitHub Copilot, and even OpenAI’s codex helped me prototype k8shazgpu. It still takes a human to glue everything together, but the tooling makes rapid prototyping fun.
I’m looking forward to seeing if I can garner some adoption of this tool among my team, and looking to see if it can be integrated in, or concepts from it, used in upstream vLLM CI.
But, I have a feeling there’s a few other directions to move in…
I’d love to share further with people how they can learn DRA and learn to utilize it as is it now.
Right now k8shazgpu is for X-number of GPUs on a single host – your workload is limited to a single host. What if your workload is multi-host, like llm-d?
I’m rather curious about how that impacts networking across boxes.
Until next time, enjoy your cappuccino and happy vibe hacking!
I’ve been pretty excited to say hello to vllm, which is a library for LLM serving and inference. That along with some recent upgrades in my lab setup, I figured it was time to do it in the way that best fit my own brand, with these parameters:
Running on Fedora
Containerized (using Podman)
Using my new 50-series card in my new desktop
And… Distributed inference across my desktop and my GPU lab server.
It was fun to do, and satisfying! I learned a lot and I’m hopeful to be able to contribute something back to the community based on what I learned. Quick note that… This is totally what I’d call a “toy setup”. If I learned anything, it’s that vLLM has a design that’s meant for industrial-type applications, like, in the DCs. The kind of place where I used to drink burnt coffee at 2am in my telco switch tech days. But for this one – we can drink legit espresso at home and kick it laid back in some Thai fisherman pants. If you’re following along, I recommend a flat white for this one, I’ve been making mine using cream-top milk from Sweet Rowen Farmstead here in Vermont. It’s got a high milk fat content, so it basically doesn’t foam at all.
I’m inspired in part by this YT video about running an LLM across multiple computers. With the work I’ve been doing in exploring Dynamic Resource Allocation (DRA) in Kubernetes, this really speaks to me. And! I’ve got a good lab setup for it, too. I like this demo because it used prosumer hardware and it was well put together. However! It doesn’t use a containerized setup. I’m really interested in the distributed aspect, because I’m also curious how this might impact K8s networking in the future, too. Last but not least I recently made an upgrade in my lab, so I’ve got 30, 40, and 50-series Nvidia cards. I had a hard time deciding if I should buy a 50-series, but I found a good deal on a 5060ti and was pretty excited it had a good chunk of VRAM, I kept wondering if it’d be worth it. Well, I have to say, with what I was able to learn using it, already worth it.
The tl;dr
Goal: Run vLLM in my home lab with distributed inference across two machines using a 50-series NVIDIA GPU (sm_120 arch).
The official image doesn’t work with 50-series GPU yet, due to unsupported CUDA arch, I built a custom Docker image using CUDA 12.9 and compiled with sm_120 support.
I used Podman on Fedora, with nvidia-container-toolkit, and CDI setup for GPU access.
Distributed inference handled via Ray; using --net=host for simplicity in a lab environment.
Bonus: Played with structured outputs – super awesome.
Next steps: Might see what I can contribute back to vLLM, especially on the tooling and CI front.
…That being said, grab your flat white and let’s hit the terminal…
Getting my bearings
My initial instinct was to build an image from the ./docker/Dockerfile.nightly_torch in a clone of vllm, because I liked the way it read the most of the Dockerfiles, especially because I was worried with a 50-series, I might have to use something poppin’ fresh.
But, in the meanwhile I tried to use their official Docker image on my desktop workstation.
However, I had to setup CDI, first.
I ran into this on a docker run…
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# This was unnecessary, and didn't work with dnf5
# sudo dnf config-manager --enable nvidia-container-toolkit-experimental
sudo dnf install -y nvidia-container-toolkit
# Don't forget to generate the CDI manifest!
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Unsure if I needed to, but, out of instinct I did...
sudo systemctl restart podman
If you’re not familiar with CDI, it’s the “container device interface”, and the CNCF container-device-interface repo is probably the place to go to learn more, but it allows us to allocate the GPU for our container to use.
Using the official vLLM Docker image
I picked up the docs for using vllm’s official docker image, which I think – in most cases, this will probably work with most hardware, but as you’ll see, we needed a little bit more for the 50-series.
podman run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
Also, their example uses Mistral 7B v0.1, which I needed access for, so, make sure your hugging face token is there, apparently you have to agree to share some info with the authors to download it. So I did that on hugging face, and got it to run, but was met with…
NVIDIA GeForce RTX 5060 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
Getting an sm_120 compatible build
By sm_120, I’m referring to the compute capability of NVIDIA Blackwell architecture (that is, the RTX 50-series GPUs).
Now it’s time to roll up our sleeves and make a full build. In the end, I used this modified Dockerfile, more on my process as you read through. I did bump CUDA to 12.9 from 12.8, and I also had to make sure I specified the new sm_120 arch, too.
I got a hint from my bud Russell about VLLM_USE_PRECOMPILED=1, which uses vLLMs prebuilt wheels. I get a build after about… 30-45ish minutes? Nice way to toast up my CPU and let my AIO cooler purr. During the build I see a lot of build instructions with sm_* stuff but not sm_120 which had me concerned, and, I didn’t get a copy/paste because it exceeded my terminal buffer, oh well! I wound up with a number of attempts, and was getting stuff when I tried to podman run it, like:
# vllm serve --model mistralai/Mistral-7B-v0.1
[...snip...]
ModuleNotFoundError: No module named 'vllm._C'
During the build – I had a couple unlucky crashes (which I didn’t go deeply into to diagnose, new F42 and all that), but! I got a build. Of course I forgot to time it, it happened overnight. I also need to figure out the caching cause I don’t think I’m caching and the Dockerfile sure looks like it has a lot of hooks for caching, and it must be SUPER important with these huge builds.
I posted my originally working Dockerfile as a GitHub gist and a diff of it (used against commit 6115b115826040ad1f49b69a8b4fdd59f0df5113, but! Be warned, I think at some point in the Dockerfile build process it pulls a fresh head from a git remote).
Some hiccough prevented me from logging into quay because of some weirdness, so I pushed… (Update! I have it on quay now!)
Getting vLLM running on a single host and giving it a test.
And I run it like this:
podman run --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--envHUGGING_FACE_HUB_TOKEN=$HF_TOKEN\--envPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \--ipc=host \
localhost/dougbtv/vllm:latest \
mistralai/Mistral-7B-v0.1 \--gpu-memory-utilization 0.5
I’m getting a lot further, but I see:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 15.47 GiB of which 68.00 MiB is free. Including non-PyTorch memory, this process has 12.51 GiB memory in use.
I tried to look around quite a bit… But I realized something VERY important. Here’s the thing:
vLLM isn’t ollama. vLLM is meant for high-throughput inference serving in data centers, not the scrappy-style local inference that I usually do. So, something like ollama uses some quantized GGUF models and CPU offloading to work in constrained environments, but, this is designed differently.
I previously assumed I could run a model that sized on this 5060ti with 16 gigs of VRAM, but, not this way. So I decided to run a tiny model instead, facebook/opt-125m
curl http://localhost:8000/v1/completions \-H"Content-Type: application/json"\-d'{
"model": "facebook/opt-125m",
"prompt": "What is the capital of Vermont?",
"max_tokens": 20
}'{"id":"cmpl-b062e3b04f48487590b36e21cfba0839","object":"text_completion","created":1746708569,"model":"facebook/opt-125m","choices":[{"index":0,"text":" Edit: Or, I heard Vermont is a home state\nI don't think you put the","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20,"prompt_tokens_details":null}}
Maybe just “Montpelier” would be an appropriate answer.
But, like The Bad News Bears, sometimes I just gotta try it, put the heart into it. And that part worked! I got the 50-series working with vLLM which was half the goal.
I tried loading a quantized model, took me a few tries to get the params right…
podman run \--gpus all \ -v /home/doug/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--envHUGGING_FACE_HUB_TOKEN=$HF_TOKEN\--envPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \--ipc=host localhost/dougbtv/vllm:latest \
TheBloke/Mistral-7B-Instruct-v0.1-AWQ \--quantization awq \--dtype float16 \--gpu-memory-utilization 0.95
Which also appears to work…
curl http://localhost:8000/v1/completions -H"Content-Type: application/json"-d'{
"model": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
"prompt": "What is the capital of Vermont?",
"max_tokens": 20
}'{"id":"cmpl-4c7d34a0a8a547d4b4fffbaa2b64fdfa","object":"text_completion","created":1746721156,"model":"TheBloke/Mistral-7B-Instruct-v0.1-AWQ","choices":[{"index":0,"text":"\nMontpelier","logprobs":null,"finish_reason":"stop","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":14,"completion_tokens":5,"prompt_tokens_details":null}}
Great. Nailed it this time.
And give it a run on the 3090 machine…
Now on the 3090 machine, which I call stimsonmt. It’s named for a nearby mountain to my house, it sits next to another box bonemt which is on the other side of the valley, so, that way I know which is which positionally, as they’re pets! Fun fact, Whereabouts IPAM CNI’s topographic logo is based on the topo for Stimson Mt!
Going to try with the example mistral model from the docs, again…
podman run --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \--envHUGGING_FACE_HUB_TOKEN=$HF_TOKEN\--envPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \--security-opt=label=disable \--device=nvidia.com/gpu=all \--ipc=host \
docker.io/dougbtv/vllm \
mistralai/Mistral-7B-v0.1 \--gpu-memory-utilization 0.95
I had to add --device=nvidia.com/gpu=all certainly because of how I had setup podman on this machine (I went and scooped the podman run that I’m using for kohya_ss! so I had a good reference). I’ll try to normalize them later.
Appears to run… let’s try it…
curl http://localhost:8000/v1/completions -H"Content-Type: application/json"-d'{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "What is the capital of Vermont?",
"max_tokens": 20
}'{"id":"cmpl-fa383b641f79470495d3e9a0fa537c25","object":"text_completion","created":1746725937,"model":"mistralai/Mistral-7B-v0.1","choices":[{"index":0,"text":"\n\nThe capital of Vermont is Montpelier, which is also the state's smallest","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":29,"completion_tokens":20,"prompt_tokens_details":null}}
Took 4 tries before it got the right answer, but it did work! That used basically the whole GPU VRAM, too.
Structured output is a process to kind of mold the LLM to give you output in a specific structure, like, say you want a return in all JSON, or a specific set of values, heck you can even use a regex!
I brewed up this python script (as the curl was getting out of hand), and made a script to determine if the sentiment is “rad” or “bogus” depending on what you feed in, so I also tried that…
$ python test.py "the powder is wicked in the trees dude"
rad
$ python test.py "parcheesy is the best game"
bogus
Getting it running across nodes.
Now that I had a baseline, I went to see if I can run it on two nodes…
After refreshing myself on a few moments in the YT video – I realize that there’s one thing I’m probably going to have to account for for now, you have to specify an interface as some parameters.
This makes me feel a few ways:
It’s so network infra kinda stuff: Specifying the interface.
This has been a decade long battle in container land.
…I know I’ve got this one on lock, one way or another *puts shades on*
We’re going old school, where it all begain for me with containers and networking…
--net=host
For this scenario, it’s probably the most efficient, and security isn’t my biggest concern. As per usual, I have a long story about how I used to use this, and now I don’t, and I feel like I was instrumental in making it so that people don’t have to do that anymore, but for today? It’s just the thing. Maybe we can return to this to handle it another way, but it’s also kind of its own project.
Ray is a unified framework for scaling AI and Python applications
Sounds just like what we want to do, awesome.
Ok that’s a good start… ray is already built in our image, phew!
root@478a2ec83613:/vllm-workspace# ray --version
2025-05-09 05:04:41,498 - INFO - Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-05-09 05:04:41,498 - INFO - NumExpr defaulting to 16 threads.
ray, version 2.46.0
Unfortunately, iproute2 isn’t installed, so we have to install it with apt install iproute2, but I wasn’t quite expecting it – but I see the host interface inside the pod.
root@e60d5253152e:/vllm-workspace# ip a | grep -P "^\d|192"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
inet 192.168.50.198/24 brd 192.168.50.255 scope global noprefixroute eno1
And I can inspect it with:
$ podman inspect --format '' condescending_mendel
pasta
Apparently! I’m not familiar with the ins-and-outs of pasta networking from podman. These docs provide a pretty good overview, it’s named that way for “Point-to-point Adaptive Shared Transport Architecture”. I’m so used to CNI. Sooooo! I’m not sure of the risks, so, we’re sticking with host networking.
Here’s a run with host networking that I’m using for checking the situation out…
podman run -it--gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \--envHUGGING_FACE_HUB_TOKEN=$HF_TOKEN\--envPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \--net=host \--ipc=host \--entrypoint=/bin/bash localhost/dougbtv/vllm:latest
I’m going to have my stimsonmt host act as the “head”, the main node, the only reason I chose it is because it has more VRAM, having a 3090. My other host (named: yoda, long story on that one) will be a worker.
So I run the start command, I can see it have the host IP address…
root@stimsonmt:/vllm-workspace# ray start --head --port=6379 --memory 24000000000
[...snip...]
Local node IP: 192.168.50.199
Let’s try to join a node… I’m just giving it 12 out of 16 gigs because yoda is a workstation, so my GUI is hogging VRAM.
ray start --memory 12000000000 --address='192.168.50.199:6379'
Cause for a w00t, I can see the combination with ray status on the head (or the node, too)
root@stimsonmt:/vllm-workspace# ray status
2025-05-09 05:39:18,663 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2025-05-09 05:39:16.967610 ========
Node status
---------------------------------------------------------------
Active:
1 node_a9d14d6a94d5e77fbdafdb8802dada067a37417f2f7ad98426aa3eb7
1 node_4122ac6d486c994414de5ed9704c329002b5ec596fb38dd0806fe1fd
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/40.0 CPU
0.0/2.0 GPU
0B/33.53GiB memory
0B/46.26GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
I had to fiddle with it, because that’s a 2 node / 4 GPU setup, and it took me a few tries, but, I wound up getting this warning which I really like in my own way:
WARNING 05-09 06:03:18 [ray_utils.py:199] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node d97d078c6e6c301dafb330684dd1aede943dcbe95dabc32bfc3fa5c7. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
It mentions Infiniband – which is the kind of thing I see get a lot of use in HPC networking scenarios, awesome. I don’t have that setup, that’s yet another blog article! But, after trying a number of combos, and failing, I think that warning is actually good news in our case.
For my test, I wound up choosing the same model Bijan (the YT creator) used, he’s aiming for something small and simple, and that works for me…
Now, we just run the serve command on the head, and let it fire up….
We can see that we get a process loaded on the node, yoda.
curl http://localhost:8000/v1/completions \-H"Content-Type: application/json"\-d'{
"model": "microsoft/Phi-3.5-mini-instruct",
"prompt": "Name two former or current US senators from Vermont.",
"max_tokens": 150,
"temperature": 0.7
}'
This calls for a: wewwwwwwwwwwwwwwt! There’s a successful completion!
doug@stimsonmt:/tmp/testgrammar$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3.5-mini-instruct",
"prompt": "Name two former or current US senators from Vermont.",
"max_tokens": 150,
"temperature": 0.7
}'
{"id":"cmpl-8606a78a9d6b4a0ba7d8290c0d8e6893","object":"text_completion","created":1746797459,"model":"microsoft/Phi-3.5-mini-instruct","choices":[{"index":0,"text":"\n\nVermont, a state known for its picturesque landscapes and progressive political stance, has had several distinguished individuals serve in the United States Senate. Here are two former and two current US senators from Vermont:\n\nFormer Senators:\n\n1. Patrick Leahy: Serving since 1975, Patrick Leahy is a Democrat who has been one of the longest-serving senators in the history of the United States. He was re-elected multiple times and has had a significant impact on national policy and legislation.\n\n2. Jim Jeffords: He served from 1989 to 2007 as a Republican senator","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":12,"total_tokens":162,"completion_tokens":150,"prompt_tokens_details":null}}
Why it somehow mentioned Jeffords before Bernie is…. An absolute mystery to me, that seems so unlikely! But… Alas, it’s also correct. And… Hurray!
And guess what? It also can use the structured output!
doug@stimsonmt:/tmp/testgrammar$ python test.py "shredding a quick uphill lap at Bolton after work"
rad
I’m also trying to figure out how to contribute some of this learning back to the codebase! Thankfully Russell also pointed me at the vllm ci infra project. And I even joined the vLLM slack!
I had been putting off upgrading Fedora because… Wayland is becoming unvoidable. And I’ve been avoiding it for a couple releases now. And Fedora 42 is out. And since it’s the answer to life, the universe and everything, I can’t deny it. Wayland is the future.
Primarily because I’ve been an i3wm user, it’s a “twm” (tiling window manager, or that’s what I call it). And I’ve gotten mighty cozy in it, but eventually, you have to move on! So, alas, I’m going back to GNOME, and getting my life setup with Wayland.
My main desktop workstation is, and has been Fedora for… basically as long as I’ve been a professional.
Yep! Fedora Core 3! Back in my day! Err, before Fedora 7, we called it Fedora Core
People can remember where they were during major historical events, and I can remember where I was when someone told me that I probably wouldn’t be using Red Hat Linux anymore, but instead… Fedora. I sandbagged for a release, I think, and eventually got on board at Fedora Core 2. Unfortunately I only have photographic proof of FC3! Apparently I do that and sandbag a release or two, still doing it!
Here’s the thing – I have a bunch of great machines my lab, a media workstation (for graphics, both traditional and AI/ML as well as music production), a GPU lab machine (for training runs and long running jobs), and a virtualization host, and… Also a primary developer’s workstation. That is kind of the vehicle to use all the better machines, so, it hasn’t gotten love in a while. But I knew I had to upgrade, and… That machine was getting a little draggy, so, time to upgrade.
Getting my life back together as a i3wm user.
My good bud Leif mentioned this article from sudoscience about turning gnome into a tiling window manager. He got me into i3wm, so, thanks for throwing me a line.
I settled on trying a few things that I wound up keeping and liking, so far…
I use Barrier so that I can use my work laptop, my media workstation, and my developer’s workstation all at the same time. But… Guess what? Apparently the maintainers have started their own fork, input-leap. That’s probably good news, but… I did struggle a bit. Wayland has some nice security considerations, so, you can’t start it without manually accepting the remote connection, which is… Sort of a bummer, so I have to KVM switch (which is really just a USB switch for my inputs) in order to wake the machine and accept the connection. The media workstation needs to be primary – especially because it uses a graphical pen tablet.
I also had to hack together some stuff to SSH my clipboard (yuuuuck) to be able to use it conveniently between boxen, because, that’s also a security issue. So, remote control, not as good, but… I’m going to live with it and work around it for now.
Pop Shell is treating me fairly well, at least I’m able to do most stuff that I could do before, just some clunking around, still getting all my chords straightened out.
Getting my Nvidia GPU to fire up.
I have other better GPUs in my lab, but I needed something in this machine, I wound up opting for a 5060ti because I could both get a good deal, and these things have a bunch of VRAM these days, so, I was kind of pumped to have that much vram around, at least it makes some things possible even if it’s got way less CUDA cores than my other GPUs in the lab.
However, I’m kind of in the early adopter phase, both a 50-series and an O/S that was just released 2 weeks ago.
Thank heavens for if-not-true-then-false (inttf)! I have used these articles for Fedora setups for a while now, that author deserves an award.
These saved my bacon. I tried installing from fusion repos, and I wasn’t having good luck. I did figure out that my card does require a beta driver, so… That might be related.
Mostly, after I got going with the inttf articles, I was making progress. I did have to reboot umpteen times while I really screwed it up, and also tested my grub2 know how (I wasn’t getting a clean boot when I first plugged in the video card, naturally). Unfortunately, I’m still getting a slow boot, an I lost graphical boot, despite inttf’s awesome Plymouth setup. I think I can remember losing graphical boot… many… many times in my Fedora career. I tried a few times, and I’ll try again as the drivers get updates and whatnot, but for now, I’m living with it.