An Experimental LTX-2 Integration for vLLM-Omni -- Expect Weird Videos

Well, doggies! When the LTX-2 video model landed last week, I went ahead and brewed up an experimental implementation of it in vllm-omni, the omni-modality serving layer for vLLM.

LTX-2 was released on January 6th, and I wanted to see how far I could get wiring it up end-to-end. From the HuggingFace repo (emphasis mine)…

LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

I was going to try it out in ComfyUI, but I decided “hey maybe I can brew up an implementation based on the Lightricks/LTX-2 GitHub repo”, so I got to it.

Today for you – I’ve got a Docker image published with my work, up on quay.io/dosmith/vllm-omni:ltx2-experimental and instructions to fire it up, and then followed by some notes on the implementation.

Since I’ve been experimenting with vllm-omni recently, I wanted to see if I could brew up an implementation – I’ve at least got it producing video (no audio! I wanted to PoC it before I tackled it):

Demo Short: Watch on YouTube

Actually that was the first thing I got to come out of it, basically (well, at least the first working thing that I intentionally prompted for!)

This one came out a little better, but it’s not quite as fancy schmancy as the videos on the ComfyUI support announcement .

Demo Short: Watch on YouTube

Also, I tried to do the “infamous Will Smith eating spaghetti” (wikipedia page) (holy smokes this has a wiki article!? I thought it was a /r/StableDiffusion meme!). Turns out… I suspect the model must’ve been scrubbed for copyright infringement type problems, and… It didn’t look like Will Smith at all (so I omitted posting it!).

These 1024x1024 10-second videos took me about 3 minutes end-to-end to run vLLM-omni and produce, on a single A100 GPU.

If you stay until the end, I’ll even show you one I generated that was inspired by the Tim & Eric Cinco I-Jammer.

Alright… Well, how do I run it!?

A few quick notes about how and where I ran it, and a few of the limitations before I get into it…

I ran it:

But other than that, it should be straightforward…

# Set your HuggingFace token
export HF_TOKEN=hf_your_token_here

podman run --rm \
  --device nvidia.com/gpu=0 \
  --security-opt=label=disable \
  --userns=keep-id \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e HF_TOKEN="${HF_TOKEN}" \
  -e HUGGINGFACE_HUB_CACHE=/hf/hub \
  -v ~/.cache/huggingface:/hf/hub \
  -v $(pwd)/outputs:/output \
  quay.io/dosmith/vllm-omni:ltx2-experimental \
  python examples/offline_inference/text_to_video_ltx2.py \
    --model Lightricks/LTX-2 \
    --prompt "A corny used-car-lot commercial set in a dusty western town" \
    --height 1024 --width 1024 --num_frames 121 \
    --fps 24.0 \
    --output /output/commercial.mp4

Note: You could probably use plain old docker too, but might require munging that command a little bit (GPU flags, the SELinux bits, etc.)

Where does it live?

I’ve got the container image posted on quay.io for you:

And I’ve got the code in my branch of vllm-omni, on GitHub:

I’ve got a docker/Dockerfile.ltx2 there with how I built it as well.

How I wound up getting there

This work is primarily meant to be experimental. I wanted to see if I could “make it work” so that I could try out both LTX-2, and adding a new model implementation in vLLM-Omni. I’m not sure where it stands in terms of permanence.

I used both the LTX-2 GitHub repo, and the ComfyUI implementation as references. Then used a combination of ChatGPT (with GPT-5.2) to orchestrate prompts for both Claude as well as some back and forth with Codex to check the work. I used these tools to then leverage the LTX-2 code primarily, with the LTX‑2 reference stack via the ltx_core + ltx_pipelines libraries, which provide the model components (transformer, VAE, schedulers, helpers) used by the vLLM‑Omni LTX‑2 pipeline in my branch.

In summary from Codex…

Debugging: the VAE problem

The initial implementation seemed feasible, but… I ran into some issue with what I recognized was probably the VAE decoding. How’d I know that? Well, if you were fiddling around with StableDiffusion 1.X when it was first coming out, you’d see this often if you misconfigured something. Also – you’d see people post similar problems on /r/stablediffusion in those days.

For an example, you can see some of the busted video here on imgur

the busted frame

Luckily, my gut was right, turns out I had the VAE decoding in the wrong spot, order-of-operations-wise.

This brought me to the point where I could make the kind of unhinged things that I like to make such as the…

Bonus: Cowboy O-Hungee

You stayed the entire blog article… and your reward is… This!? Sorry.

Demo Short: Watch on YouTube