An Experimental LTX-2 Integration for vLLM-Omni -- Expect Weird Videos
13 Jan 2026Well, doggies! When the LTX-2 video model landed last week, I went ahead and brewed up an experimental implementation of it in vllm-omni, the omni-modality serving layer for vLLM.
LTX-2 was released on January 6th, and I wanted to see how far I could get wiring it up end-to-end. From the HuggingFace repo (emphasis mine)…
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
I was going to try it out in ComfyUI, but I decided “hey maybe I can brew up an implementation based on the Lightricks/LTX-2 GitHub repo”, so I got to it.
Today for you – I’ve got a Docker image published with my work, up on quay.io/dosmith/vllm-omni:ltx2-experimental and instructions to fire it up, and then followed by some notes on the implementation.
Since I’ve been experimenting with vllm-omni recently, I wanted to see if I could brew up an implementation – I’ve at least got it producing video (no audio! I wanted to PoC it before I tackled it):
Actually that was the first thing I got to come out of it, basically (well, at least the first working thing that I intentionally prompted for!)
This one came out a little better, but it’s not quite as fancy schmancy as the videos on the ComfyUI support announcement .
Also, I tried to do the “infamous Will Smith eating spaghetti” (wikipedia page) (holy smokes this has a wiki article!? I thought it was a /r/StableDiffusion meme!). Turns out… I suspect the model must’ve been scrubbed for copyright infringement type problems, and… It didn’t look like Will Smith at all (so I omitted posting it!).
These 1024x1024 10-second videos took me about 3 minutes end-to-end to run vLLM-omni and produce, on a single A100 GPU.
If you stay until the end, I’ll even show you one I generated that was inspired by the Tim & Eric Cinco I-Jammer.
Alright… Well, how do I run it!?
A few quick notes about how and where I ran it, and a few of the limitations before I get into it…
I ran it:
- On an NVIDIA A100 (80G of VRAM)
- Using Podman
- On RHEL 9.6
But other than that, it should be straightforward…
# Set your HuggingFace token
export HF_TOKEN=hf_your_token_here
podman run --rm \
--device nvidia.com/gpu=0 \
--security-opt=label=disable \
--userns=keep-id \
-e NVIDIA_VISIBLE_DEVICES=0 \
-e CUDA_VISIBLE_DEVICES=0 \
-e HF_TOKEN="${HF_TOKEN}" \
-e HUGGINGFACE_HUB_CACHE=/hf/hub \
-v ~/.cache/huggingface:/hf/hub \
-v $(pwd)/outputs:/output \
quay.io/dosmith/vllm-omni:ltx2-experimental \
python examples/offline_inference/text_to_video_ltx2.py \
--model Lightricks/LTX-2 \
--prompt "A corny used-car-lot commercial set in a dusty western town" \
--height 1024 --width 1024 --num_frames 121 \
--fps 24.0 \
--output /output/commercial.mp4
Note: You could probably use plain old docker too, but might require munging that command a little bit (GPU flags, the SELinux bits, etc.)
Where does it live?
I’ve got the container image posted on quay.io for you:
And I’ve got the code in my branch of vllm-omni, on GitHub:
- dougbtv/vllm-omni in the
ltx2branch.
I’ve got a docker/Dockerfile.ltx2 there with how I built it as well.
How I wound up getting there
This work is primarily meant to be experimental. I wanted to see if I could “make it work” so that I could try out both LTX-2, and adding a new model implementation in vLLM-Omni. I’m not sure where it stands in terms of permanence.
I used both the LTX-2 GitHub repo, and the ComfyUI implementation as references. Then used a combination of ChatGPT (with GPT-5.2) to orchestrate prompts for both Claude as well as some back and forth with Codex to check the work. I used these tools to then leverage the LTX-2 code primarily, with the LTX‑2 reference stack via the ltx_core + ltx_pipelines libraries, which provide the model components (transformer, VAE, schedulers, helpers) used by the vLLM‑Omni LTX‑2 pipeline in my branch.
In summary from Codex…
- Introduced an LTX‑2 diffusion pipeline with model loading, distilled 8‑step denoising, video‑only (no audio), and post‑processing.
- Added LTX‑2 configuration helpers to enforce dimension constraints (32‑pixel multiples + 8K+1 frames) and provide defaults for fps, inference steps, and frame count.
- Registered LTX‑2 in the diffusion registry and wired up a dedicated post‑process function for decoding video outputs.
- Shipped a detailed offline example that shows how to run LTX‑2, extract pipeline outputs, and export an MP4 using PyAV.
Debugging: the VAE problem
The initial implementation seemed feasible, but… I ran into some issue with what I recognized was probably the VAE decoding. How’d I know that? Well, if you were fiddling around with StableDiffusion 1.X when it was first coming out, you’d see this often if you misconfigured something. Also – you’d see people post similar problems on /r/stablediffusion in those days.
For an example, you can see some of the busted video here on imgur

Luckily, my gut was right, turns out I had the VAE decoding in the wrong spot, order-of-operations-wise.
This brought me to the point where I could make the kind of unhinged things that I like to make such as the…
Bonus: Cowboy O-Hungee
You stayed the entire blog article… and your reward is… This!? Sorry.