Skip to content

feat: add --caption-images flag for VLM-based image description in RAG#2654

Open
bmahabirbu wants to merge 1 commit intocontainers:mainfrom
bmahabirbu:caption-images
Open

feat: add --caption-images flag for VLM-based image description in RAG#2654
bmahabirbu wants to merge 1 commit intocontainers:mainfrom
bmahabirbu:caption-images

Conversation

@bmahabirbu
Copy link
Copy Markdown
Collaborator

@bmahabirbu bmahabirbu commented Apr 23, 2026

Summary

  • Adds --caption-images [MODEL] flag to ramalama rag that optionally starts a third llama.cpp server running a general-purpose VLM (default: Gemma 4) to describe charts, diagrams, and photos found in documents
  • Image descriptions are injected as [Image: ...] text into the document before chunking, improving RAG retrieval for image-heavy content
  • Consolidates captioning into a single flag: --caption-images uses the default model, --caption-images hf://some/model uses a custom one
  • Refactors _cleanup_servers() to accept lists of servers/processes, supporting a variable number of containers
  • Guards against empty VLM API responses and rejects API transport models that cannot serve locally

Test plan

  • Run ramalama rag --caption-images ./docs/ myrag:latest with documents containing images and verify captions appear in chunked text
  • Run ramalama rag ./docs/ myrag:latest without --caption-images and verify no captioning server is started (backward compatible)
  • Run ramalama rag --caption-images hf://some/other-model ./docs/ myrag:latest to verify custom model override works
  • Verify cleanup correctly stops all containers including the caption server

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 361521cd-4774-4a42-8984-4861580b3f9a

📥 Commits

Reviewing files that changed from the base of the PR and between a9792a9 and 8e90f57.

📒 Files selected for processing (6)
  • container-images/scripts/doc2rag
  • docs/ramalama-rag.1.md
  • ramalama/plugins/runtimes/inference/llama_cpp.py
  • ramalama/plugins/runtimes/inference/llama_cpp_commands.py
  • ramalama/plugins/runtimes/inference/rag/cli.py
  • ramalama/plugins/runtimes/inference/rag/handler.py
✅ Files skipped from review due to trivial changes (1)
  • ramalama/plugins/runtimes/inference/llama_cpp_commands.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ramalama/plugins/runtimes/inference/llama_cpp.py
  • container-images/scripts/doc2rag

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added image captioning functionality to RAG pipeline using Vision Language Models (VLMs) to generate text descriptions for images in documents prior to chunking.
    • Introduced --caption-images CLI option to enable image captioning with configurable model selection.
  • Documentation

    • Updated RAG command documentation with new --caption-images option details.

Walkthrough

Adds optional VLM-based image captioning to the RAG pipeline: a new ImageCaptioner captions PictureItem images via a configurable caption server and injects "[Image: ...]" text during document chunking; the RAG handler/CLI can provision and pass a caption server URL and manages its lifecycle.

Changes

Cohort / File(s) Summary
Image captioning core
container-images/scripts/doc2rag
Added ImageCaptioner; chunk_documents(..., captioner=None) and _docling_doc_to_text(..., captioner=None) now attempt to caption PictureItem images, append "[Image: ...]" captions, handle failures with warnings, and added --caption-url CLI handling.
RAG handler & server lifecycle
ramalama/plugins/runtimes/inference/rag/handler.py
Supports optional caption llama.cpp server: reserves port, builds serve args, rejects API transports for captioning, ensures model availability, waits for /health, injects caption_url into args, and generalizes _cleanup_servers to accept lists of serve-args and processes.
RAG CLI surface
ramalama/plugins/runtimes/inference/rag/cli.py, docs/ramalama-rag.1.md
Added --caption-images CLI flag (optional model value) and documented captioning option in the manpage.
Invocation & command plumbing
ramalama/plugins/runtimes/inference/llama_cpp_commands.py
_cmd_rag now conditionally appends --caption-url to the doc2rag invocation when provided.
Cleanup callsite adjustment
ramalama/plugins/runtimes/inference/llama_cpp.py
Updated _cleanup_servers calls to pass lists for serve-args and processes to match the generalized cleanup signature.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant Handler as RAG Handler
    participant Doc2Rag as doc2rag (container)
    participant CapServer as Caption Server (llama.cpp)
    participant EmbedServer as Embed Server

    User->>Handler: invoke rag with --caption-images
    Handler->>CapServer: start caption server (model, port)
    CapServer-->>Handler: /health ready
    Handler->>EmbedServer: start embed server
    EmbedServer-->>Handler: ready
    Handler->>Doc2Rag: run RAG container with --caption-url
    Doc2Rag->>CapServer: POST image (base64 PNG + prompt)
    CapServer-->>Doc2Rag: caption text
    Doc2Rag->>Doc2Rag: insert "[Image: ...]" into document text
    Doc2Rag->>Doc2Rag: chunk documents (captioner used) and continue RAG flow
    Handler->>CapServer: cleanup
    Handler->>EmbedServer: cleanup
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • maxamillion
  • swarajpande5
  • jhjaggars
  • cgruver
  • engelmi

Poem

🐰 I nibble pixels into tiny lines,
I hum to servers, swap routes and signs.
A caption tucked where pictures hide,
Now text and images walk side by side.
Small hops — big help for RAG today!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature: adding a --caption-images flag for VLM-based image captioning in the RAG pipeline.
Description check ✅ Passed The description is directly related to the changeset, detailing the feature implementation, refactoring, and test plan for VLM-based image captioning functionality.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces image captioning capabilities to the RAG pipeline by integrating a general-purpose VLM. It adds an ImageCaptioner class, new CLI arguments, and updates the orchestration logic to manage an optional third server for image descriptions. A review comment suggests adding safety checks when parsing the VLM API response to prevent potential index errors if the response is empty.

Comment thread container-images/scripts/doc2rag Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ramalama/plugins/runtimes/inference/rag/cli.py`:
- Around line 85-91: The CLI currently sets --caption-model default to a
non-None value but rag_handler only starts captioning when caption_images is
true, making --caption-model a silent no-op; change the parser.add_argument for
"caption_model" to default=None and update rag_handler to treat an explicit
caption_model as enabling captioning (i.e., if args.caption_images or
args.caption_model then set caption_model = args.caption_model or CAPTION_MODEL
and start the caption server), or alternatively validate and raise an error if
args.caption_model is provided without args.caption_images—ensure references to
parser.add_argument("caption_model"), rag_handler, args.caption_images,
args.caption_model, and CAPTION_MODEL are updated accordingly.

In `@ramalama/plugins/runtimes/inference/rag/handler.py`:
- Around line 78-81: The captioning path currently constructs caption_transport
= New(caption_model, caption_serve_args) and proceeds to call
ensure_model_exists/serve_nonblocking even when New(...) returns an APITransport
(a hosted provider) which cannot be served locally; mirror the embedding-side
behavior by detecting APITransport after creating caption_transport (check
type/name or isinstance(caption_transport, APITransport)) and raise/return an
error or skip local serving when that is the case; update the code around
caption_transport, caption_model, caption_serve_args and the New(...) call to
reject hosted API models for the caption-server path before calling
ensure_model_exists or serve_nonblocking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9c7b7eb5-4c1e-4241-8f57-d2b061058463

📥 Commits

Reviewing files that changed from the base of the PR and between 8eb2135 and 6d10b4f.

📒 Files selected for processing (5)
  • container-images/scripts/doc2rag
  • ramalama/plugins/runtimes/inference/llama_cpp.py
  • ramalama/plugins/runtimes/inference/llama_cpp_commands.py
  • ramalama/plugins/runtimes/inference/rag/cli.py
  • ramalama/plugins/runtimes/inference/rag/handler.py

Comment thread ramalama/plugins/runtimes/inference/rag/cli.py Outdated
Comment thread ramalama/plugins/runtimes/inference/rag/handler.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@container-images/scripts/doc2rag`:
- Around line 250-259: In the PictureItem branch (check for PictureItem,
captioner, item.image) ensure a failed or empty caption does not drop the image:
after attempting description = captioner.caption(pil_image) and in the except
block, set a fallback_text = (item.text.strip() if getattr(item, "text", None)
else "[Image]") and append parts.append(f"\n[Image: {fallback_text}]\n") so that
when description is falsy or an exception occurs you preserve the item's text
(or a neutral placeholder) instead of omitting the image; update the code paths
that currently only append when description is truthy to use this fallback logic
and keep existing logger.warning in the exception handler.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 93892d0b-41d9-49f4-b5f0-f4e329d77af4

📥 Commits

Reviewing files that changed from the base of the PR and between 199c043 and a9792a9.

📒 Files selected for processing (6)
  • container-images/scripts/doc2rag
  • docs/ramalama-rag.1.md
  • ramalama/plugins/runtimes/inference/llama_cpp.py
  • ramalama/plugins/runtimes/inference/llama_cpp_commands.py
  • ramalama/plugins/runtimes/inference/rag/cli.py
  • ramalama/plugins/runtimes/inference/rag/handler.py
✅ Files skipped from review due to trivial changes (1)
  • ramalama/plugins/runtimes/inference/llama_cpp_commands.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ramalama/plugins/runtimes/inference/llama_cpp.py
  • ramalama/plugins/runtimes/inference/rag/handler.py

Comment on lines +250 to 259
elif isinstance(item, PictureItem) and captioner and item.image:
try:
pil_image = item.image.pil_image
if pil_image:
description = captioner.caption(pil_image)
if description:
parts.append(f"\n[Image: {description.strip()}]\n")
except Exception as e:
logger.warning("Failed to caption image: %s", e)
elif hasattr(item, "text") and item.text:
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve figure content when captioning fails.

In the PictureItem path, a failed/empty caption currently drops the image content entirely for that item. Add a fallback to existing item.text (or a neutral placeholder) so chunking does not silently lose information.

Proposed fix
-        elif isinstance(item, PictureItem) and captioner and item.image:
-            try:
-                pil_image = item.image.pil_image
-                if pil_image:
-                    description = captioner.caption(pil_image)
-                    if description:
-                        parts.append(f"\n[Image: {description.strip()}]\n")
-            except Exception as e:
-                logger.warning("Failed to caption image: %s", e)
+        elif isinstance(item, PictureItem) and captioner and item.image:
+            description = None
+            try:
+                pil_image = item.image.pil_image
+                if pil_image:
+                    description = captioner.caption(pil_image)
+            except Exception as e:
+                logger.warning("Failed to caption image: %s", e)
+
+            if description:
+                parts.append(f"\n[Image: {description.strip()}]\n")
+            elif getattr(item, "text", None):
+                text = item.text.strip()
+                if text:
+                    parts.append(text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@container-images/scripts/doc2rag` around lines 250 - 259, In the PictureItem
branch (check for PictureItem, captioner, item.image) ensure a failed or empty
caption does not drop the image: after attempting description =
captioner.caption(pil_image) and in the except block, set a fallback_text =
(item.text.strip() if getattr(item, "text", None) else "[Image]") and append
parts.append(f"\n[Image: {fallback_text}]\n") so that when description is falsy
or an exception occurs you preserve the item's text (or a neutral placeholder)
instead of omitting the image; update the code paths that currently only append
when description is truthy to use this fallback logic and keep existing
logger.warning in the exception handler.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmahabirbu WDYT?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Add optional image captioning during RAG document processing using a
general-purpose VLM (default: Gemma 4). When --caption-images is enabled,
a third llama.cpp server describes charts, diagrams, and photos found in
documents, injecting descriptions into the text before chunking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Brian <bmahabir@bu.edu>
@mikebonnet
Copy link
Copy Markdown
Collaborator

This looks really cool! Could you add some unit and e2e (slow) tests for it?

@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented May 6, 2026

@bmahabirbu Needs a rebase and then we could get this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants