feat: add --caption-images flag for VLM-based image description in RAG by bmahabirbu · Pull Request #2654 · containers/ramalama

bmahabirbu · 2026-04-23T01:54:11Z

Summary

Adds --caption-images [MODEL] flag to ramalama rag that optionally starts a third llama.cpp server running a general-purpose VLM (default: Gemma 4) to describe charts, diagrams, and photos found in documents
Image descriptions are injected as [Image: ...] text into the document before chunking, improving RAG retrieval for image-heavy content
Consolidates captioning into a single flag: --caption-images uses the default model, --caption-images hf://some/model uses a custom one
Refactors _cleanup_servers() to accept lists of servers/processes, supporting a variable number of containers
Guards against empty VLM API responses and rejects API transport models that cannot serve locally

Test plan

Run ramalama rag --caption-images ./docs/ myrag:latest with documents containing images and verify captions appear in chunked text
Run ramalama rag ./docs/ myrag:latest without --caption-images and verify no captioning server is started (backward compatible)
Run ramalama rag --caption-images hf://some/other-model ./docs/ myrag:latest to verify custom model override works
Verify cleanup correctly stops all containers including the caption server

🤖 Generated with Claude Code

coderabbitai · 2026-04-23T01:54:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 361521cd-4774-4a42-8984-4861580b3f9a

📥 Commits

Reviewing files that changed from the base of the PR and between a9792a9 and 8e90f57.

📒 Files selected for processing (6)

container-images/scripts/doc2rag
docs/ramalama-rag.1.md
ramalama/plugins/runtimes/inference/llama_cpp.py
ramalama/plugins/runtimes/inference/llama_cpp_commands.py
ramalama/plugins/runtimes/inference/rag/cli.py
ramalama/plugins/runtimes/inference/rag/handler.py

✅ Files skipped from review due to trivial changes (1)

ramalama/plugins/runtimes/inference/llama_cpp_commands.py

🚧 Files skipped from review as they are similar to previous changes (2)

ramalama/plugins/runtimes/inference/llama_cpp.py
container-images/scripts/doc2rag

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added image captioning functionality to RAG pipeline using Vision Language Models (VLMs) to generate text descriptions for images in documents prior to chunking.
- Introduced --caption-images CLI option to enable image captioning with configurable model selection.
Documentation
- Updated RAG command documentation with new --caption-images option details.

Walkthrough

Adds optional VLM-based image captioning to the RAG pipeline: a new ImageCaptioner captions PictureItem images via a configurable caption server and injects "[Image: ...]" text during document chunking; the RAG handler/CLI can provision and pass a caption server URL and manages its lifecycle.

Changes

Cohort / File(s)	Summary
Image captioning core `container-images/scripts/doc2rag`	Added `ImageCaptioner`; `chunk_documents(..., captioner=None)` and `_docling_doc_to_text(..., captioner=None)` now attempt to caption `PictureItem` images, append `"[Image: ...]"` captions, handle failures with warnings, and added `--caption-url` CLI handling.
RAG handler & server lifecycle `ramalama/plugins/runtimes/inference/rag/handler.py`	Supports optional caption llama.cpp server: reserves port, builds serve args, rejects API transports for captioning, ensures model availability, waits for `/health`, injects `caption_url` into args, and generalizes `_cleanup_servers` to accept lists of serve-args and processes.
RAG CLI surface `ramalama/plugins/runtimes/inference/rag/cli.py`, `docs/ramalama-rag.1.md`	Added `--caption-images` CLI flag (optional model value) and documented captioning option in the manpage.
Invocation & command plumbing `ramalama/plugins/runtimes/inference/llama_cpp_commands.py`	`_cmd_rag` now conditionally appends `--caption-url` to the doc2rag invocation when provided.
Cleanup callsite adjustment `ramalama/plugins/runtimes/inference/llama_cpp.py`	Updated `_cleanup_servers` calls to pass lists for serve-args and processes to match the generalized cleanup signature.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant Handler as RAG Handler
    participant Doc2Rag as doc2rag (container)
    participant CapServer as Caption Server (llama.cpp)
    participant EmbedServer as Embed Server

    User->>Handler: invoke rag with --caption-images
    Handler->>CapServer: start caption server (model, port)
    CapServer-->>Handler: /health ready
    Handler->>EmbedServer: start embed server
    EmbedServer-->>Handler: ready
    Handler->>Doc2Rag: run RAG container with --caption-url
    Doc2Rag->>CapServer: POST image (base64 PNG + prompt)
    CapServer-->>Doc2Rag: caption text
    Doc2Rag->>Doc2Rag: insert "[Image: ...]" into document text
    Doc2Rag->>Doc2Rag: chunk documents (captioner used) and continue RAG flow
    Handler->>CapServer: cleanup
    Handler->>EmbedServer: cleanup

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

maxamillion
swarajpande5
jhjaggars
cgruver
engelmi

Poem

🐰 I nibble pixels into tiny lines,
I hum to servers, swap routes and signs.
A caption tucked where pictures hide,
Now text and images walk side by side.
Small hops — big help for RAG today!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main feature: adding a --caption-images flag for VLM-based image captioning in the RAG pipeline.
Description check	✅ Passed	The description is directly related to the changeset, detailing the feature implementation, refactoring, and test plan for VLM-based image captioning functionality.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces image captioning capabilities to the RAG pipeline by integrating a general-purpose VLM. It adds an ImageCaptioner class, new CLI arguments, and updates the orchestration logic to manage an optional third server for image descriptions. A review comment suggests adding safety checks when parsing the VLM API response to prevent potential index errors if the response is empty.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ramalama/plugins/runtimes/inference/rag/cli.py`:
- Around line 85-91: The CLI currently sets --caption-model default to a
non-None value but rag_handler only starts captioning when caption_images is
true, making --caption-model a silent no-op; change the parser.add_argument for
"caption_model" to default=None and update rag_handler to treat an explicit
caption_model as enabling captioning (i.e., if args.caption_images or
args.caption_model then set caption_model = args.caption_model or CAPTION_MODEL
and start the caption server), or alternatively validate and raise an error if
args.caption_model is provided without args.caption_images—ensure references to
parser.add_argument("caption_model"), rag_handler, args.caption_images,
args.caption_model, and CAPTION_MODEL are updated accordingly.

In `@ramalama/plugins/runtimes/inference/rag/handler.py`:
- Around line 78-81: The captioning path currently constructs caption_transport
= New(caption_model, caption_serve_args) and proceeds to call
ensure_model_exists/serve_nonblocking even when New(...) returns an APITransport
(a hosted provider) which cannot be served locally; mirror the embedding-side
behavior by detecting APITransport after creating caption_transport (check
type/name or isinstance(caption_transport, APITransport)) and raise/return an
error or skip local serving when that is the case; update the code around
caption_transport, caption_model, caption_serve_args and the New(...) call to
reject hosted API models for the caption-server path before calling
ensure_model_exists or serve_nonblocking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9c7b7eb5-4c1e-4241-8f57-d2b061058463

📥 Commits

Reviewing files that changed from the base of the PR and between 8eb2135 and 6d10b4f.

📒 Files selected for processing (5)

container-images/scripts/doc2rag
ramalama/plugins/runtimes/inference/llama_cpp.py
ramalama/plugins/runtimes/inference/llama_cpp_commands.py
ramalama/plugins/runtimes/inference/rag/cli.py
ramalama/plugins/runtimes/inference/rag/handler.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@container-images/scripts/doc2rag`:
- Around line 250-259: In the PictureItem branch (check for PictureItem,
captioner, item.image) ensure a failed or empty caption does not drop the image:
after attempting description = captioner.caption(pil_image) and in the except
block, set a fallback_text = (item.text.strip() if getattr(item, "text", None)
else "[Image]") and append parts.append(f"\n[Image: {fallback_text}]\n") so that
when description is falsy or an exception occurs you preserve the item's text
(or a neutral placeholder) instead of omitting the image; update the code paths
that currently only append when description is truthy to use this fallback logic
and keep existing logger.warning in the exception handler.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 93892d0b-41d9-49f4-b5f0-f4e329d77af4

📥 Commits

Reviewing files that changed from the base of the PR and between 199c043 and a9792a9.

📒 Files selected for processing (6)

container-images/scripts/doc2rag
docs/ramalama-rag.1.md
ramalama/plugins/runtimes/inference/llama_cpp.py
ramalama/plugins/runtimes/inference/llama_cpp_commands.py
ramalama/plugins/runtimes/inference/rag/cli.py
ramalama/plugins/runtimes/inference/rag/handler.py

✅ Files skipped from review due to trivial changes (1)

ramalama/plugins/runtimes/inference/llama_cpp_commands.py

🚧 Files skipped from review as they are similar to previous changes (2)

ramalama/plugins/runtimes/inference/llama_cpp.py
ramalama/plugins/runtimes/inference/rag/handler.py

coderabbitai · 2026-04-23T02:11:12Z

+        elif isinstance(item, PictureItem) and captioner and item.image:
+            try:
+                pil_image = item.image.pil_image
+                if pil_image:
+                    description = captioner.caption(pil_image)
+                    if description:
+                        parts.append(f"\n[Image: {description.strip()}]\n")
+            except Exception as e:
+                logger.warning("Failed to caption image: %s", e)
        elif hasattr(item, "text") and item.text:


⚠️ Potential issue | 🟠 Major

Preserve figure content when captioning fails.

In the PictureItem path, a failed/empty caption currently drops the image content entirely for that item. Add a fallback to existing item.text (or a neutral placeholder) so chunking does not silently lose information.

Proposed fix

- elif isinstance(item, PictureItem) and captioner and item.image: - try: - pil_image = item.image.pil_image - if pil_image: - description = captioner.caption(pil_image) - if description: - parts.append(f"\n[Image: {description.strip()}]\n") - except Exception as e: - logger.warning("Failed to caption image: %s", e) + elif isinstance(item, PictureItem) and captioner and item.image: + description = None + try: + pil_image = item.image.pil_image + if pil_image: + description = captioner.caption(pil_image) + except Exception as e: + logger.warning("Failed to caption image: %s", e) + + if description: + parts.append(f"\n[Image: {description.strip()}]\n") + elif getattr(item, "text", None): + text = item.text.strip() + if text: + parts.append(text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@container-images/scripts/doc2rag` around lines 250 - 259, In the PictureItem branch (check for PictureItem, captioner, item.image) ensure a failed or empty caption does not drop the image: after attempting description = captioner.caption(pil_image) and in the except block, set a fallback_text = (item.text.strip() if getattr(item, "text", None) else "[Image]") and append parts.append(f"\n[Image: {fallback_text}]\n") so that when description is falsy or an exception occurs you preserve the item's text (or a neutral placeholder) instead of omitting the image; update the code paths that currently only append when description is truthy to use this fallback logic and keep existing logger.warning in the exception handler.

@bmahabirbu WDYT?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Add optional image captioning during RAG document processing using a general-purpose VLM (default: Gemma 4). When --caption-images is enabled, a third llama.cpp server describes charts, diagrams, and photos found in documents, injecting descriptions into the text before chunking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Brian <bmahabir@bu.edu>

mikebonnet · 2026-04-23T17:13:22Z

This looks really cool! Could you add some unit and e2e (slow) tests for it?

rhatdan · 2026-05-06T15:23:28Z

@bmahabirbu Needs a rebase and then we could get this in.

bmahabirbu requested review from cgruver, engelmi, jhjaggars, maxamillion, mikebonnet, olliewalsh, rhatdan and swarajpande5 as code owners April 23, 2026 01:54

bmahabirbu temporarily deployed to macos-installer April 23, 2026 01:54 — with GitHub Actions Inactive

bmahabirbu force-pushed the caption-images branch from 6d10b4f to 199c043 Compare April 23, 2026 01:57

bmahabirbu temporarily deployed to macos-installer April 23, 2026 01:57 — with GitHub Actions Inactive

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread container-images/scripts/doc2rag Outdated

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread ramalama/plugins/runtimes/inference/rag/cli.py Outdated

Comment thread ramalama/plugins/runtimes/inference/rag/handler.py

bmahabirbu force-pushed the caption-images branch from 199c043 to a9792a9 Compare April 23, 2026 02:04

bmahabirbu temporarily deployed to macos-installer April 23, 2026 02:05 — with GitHub Actions Inactive

bmahabirbu force-pushed the caption-images branch from a9792a9 to 7cb6092 Compare April 23, 2026 02:09

bmahabirbu temporarily deployed to macos-installer April 23, 2026 02:09 — with GitHub Actions Inactive

bmahabirbu force-pushed the caption-images branch from 7cb6092 to 26bbdbe Compare April 23, 2026 02:10

bmahabirbu temporarily deployed to macos-installer April 23, 2026 02:10 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

bmahabirbu force-pushed the caption-images branch from 26bbdbe to 8e90f57 Compare April 23, 2026 02:19

bmahabirbu temporarily deployed to macos-installer April 23, 2026 02:19 — with GitHub Actions Inactive

Conversation

bmahabirbu commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhatdan Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

mikebonnet commented Apr 23, 2026

Uh oh!

rhatdan commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmahabirbu commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot Apr 23, 2026 •

edited

Loading