Using LLaVA for captioning Stable Diffusion fine tuning datasets

In this article, we’re going to use LLaVa (running under ollama) to caption images for a Stable Diffusion training dataset, well fine tuning in my case, I’ve usually been baking LoRAs with the Kohya SS GUI.

Something I’ve been hearing about is that people are using LLaVa to caption their datasets for training Stable Diffusion LoRAs (low rank adapations, a kind of fine tuning of a model). And I was like – this would be great, I have a few big datasets, and I have my own ways of adding some info from metadata I might have – but, I’d love to get it more detailed, too.

From the page:

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

What the really means in plainer terms is – you can prompt it with text and images and get text output generated by the LLM, relevant to the imagery you provided.

Previously, I’ve been using BLIP (huggingface) from within the captioning tools of Kohya SS (my favorite training app of the moment), and then I’d sometimes munge those captions with sed and call it a day.

However, this really method using LLaVa really intruiges me, so I wanted to get it setup.

Granted – I did think about whipping up a script to do this using GPT-4 multimodal, and I will also probably try that at some point. But! It’s not nearly as fun, nor as rad, as having your own local multimodal setup. I also had put this project in my notes for that purpose too: github.com/jiayev/GPT4V-Image-Captioner. No tto mention, since you own it on your own gear – you get to make your own rules. Naturally, I love cloud computing, but I’m often reminded of Stallman’s “can you trust your computer?” essay in the age of centralized applications.

I will also provide my own script to make API queries against ollama, but, I’ll provide a few links to other tools if you’d rather start with something more full fleged – in this case I just wanted to be able to do my own prompt engineering and having my own method to keep munging my captions. It’s not a lot of code, so it’s easy to get wrangled in yourself.

Prerequisites

Getting LLaVa running with ollama

First get ollama installed, it’s a nicely packaged opionated way to get LLMs running in containers, which I quite like. If you wanna learn more about LLaVa on ollama, you can also check out this great youtube video by Learn Data with Mark.

Then I went and pulled this model (you can pull it manually, or it should pull when you do ollama run ...):

https://ollama.com/library/llava:13b

Ok, I go ahead and start ollama serve…

$ screen -S ollamaserve
ollama serve

I’m starting with the 13b param model @ https://ollama.com/library/llava:13b

Then I kick it off with…

$ ollama run llava:13b

And let’s give it a test drive, I use an example image from a dataset I have going for Adirondack guideboats

>>> please describe this image /path-to/dataset/guideboat/.../example.jpg
Added image '/path-to/dataset/guideboat/.../example.jpg'
 The image you've provided appears to be an old black and white photograph. It shows a group of people in boats on calm waters, likely a river or lake. There are several individuals visible, with one person who seems to be actively
rowing or maneuvering the boat. In the background, there is land with trees and vegetation, indicating a natural setting. The image has a vintage appearance due to its monochrome color scheme and grainy texture, which suggests it 
could be from an earlier time period, possibly mid-20th century or earlier, judging by the style of clothing and equipment.

Looks like a pure win! I didn’t even look at the image, I just know it’s right, that matches most of the images.

Captioning images: Some existing tools

I saw on reddit that taggui now supports LLaVa captioning, so you might want to check it out @ https://github.com/jhc13/taggui/

Captioning images

So seems like there’s two ways to do this…

  1. From the CLI I can provide a path to an image
  2. Via the API I can provide a base64 encoded image

we’re going to use this library…

https://pypi.org/project/ollama/

And take a look at the API docs…

https://github.com/ollama/ollama/blob/main/docs/api.md

Later, I think we’re going to create a modelfile – for example… https://github.com/ollama/ollama/blob/main/docs/modelfile.md – there we can better define an overall system prompt.

In the meanwhile – we will just confidently emulate it.

For now though, I think I only need a little bit of context for my system prompt, so we’ll just jam it right in.

I used GPT-4 to feed in some info about the library and the API and having it whip me up something quick – this script doesn’t need to be intense, just an intro prompt and then a way to cycle through images and output a caption.

Using my dougbtv/ollama-llava-captioner script

I put my quick scripts up on github @ dougbtv/ollama-llava-captioner

First, you’ve got to make sure you pip install ollama, it’s the only dep…

Then, you just run it with a path to your folder with images…

doug@stimsonmt:~/ai-ml/llava-captioner$ python llava-caption-ollama.py /path/to/dataset/guideboat/images/15_guideboat/
[...]
Processing 005C3BDB-6751-4E98-9CEE-352236552770.jpg (1/1260)...
Generated Caption:  Vintage black and white photograph of two people in a small boat on a wavy lake. They appear to be engaged in fishing or enjoying the tranquil water. The background shows a hazy, serene skyline with houses along the shoreline under a cloudy sky.
Completed 0.08% in 2.12s, estimated 2663.50s left.
Processing 00764EBA-A1C4-4687-B45A-226973315006.jpg (2/1260)...
Generated Caption:  An old, vintage black and white photograph depicting an early aviation scene. A seaplane is gliding over a serene lake nestled among pine trees and hills in the background, capturing a moment of adventure and exploration amidst nature's tranquility.
Completed 0.16% in 4.16s, estimated 2615.48s left.

It uses a kind of “pre prompt” to set up the instructions for captioning. I think you should probably tune this, so you can start with the prompt.txt in the root dir of the repo, and then modify it yourself, and you either just change it in place or run the script with a path to where your saved your modified (or new!) prompt.

$ python llava-caption-ollama.py /path/to/dataset/guideboat/images/15_guideboat/ /path/to/my.prompt.txt

In both cases it will save a .txt file in the same folder as your images, with the same base file name (e.g. before the extension).

In conclusion…

This seems to be more accurate, at least reading as a human, than BLIP captioning is. I haven’t put together an organized “before/after” on the datasets I’ve tried this with, but my intuition says it does work quite a bit better, I’ll try to come back with some results in the future, but until then…

Happy captioning!