feat: add --caption-images flag for VLM-based image description in RAG#2654
feat: add --caption-images flag for VLM-based image description in RAG#2654bmahabirbu wants to merge 1 commit intocontainers:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds optional VLM-based image captioning to the RAG pipeline: a new ImageCaptioner captions PictureItem images via a configurable caption server and injects "[Image: ...]" text during document chunking; the RAG handler/CLI can provision and pass a caption server URL and manages its lifecycle. Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Handler as RAG Handler
participant Doc2Rag as doc2rag (container)
participant CapServer as Caption Server (llama.cpp)
participant EmbedServer as Embed Server
User->>Handler: invoke rag with --caption-images
Handler->>CapServer: start caption server (model, port)
CapServer-->>Handler: /health ready
Handler->>EmbedServer: start embed server
EmbedServer-->>Handler: ready
Handler->>Doc2Rag: run RAG container with --caption-url
Doc2Rag->>CapServer: POST image (base64 PNG + prompt)
CapServer-->>Doc2Rag: caption text
Doc2Rag->>Doc2Rag: insert "[Image: ...]" into document text
Doc2Rag->>Doc2Rag: chunk documents (captioner used) and continue RAG flow
Handler->>CapServer: cleanup
Handler->>EmbedServer: cleanup
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
6d10b4f to
199c043
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces image captioning capabilities to the RAG pipeline by integrating a general-purpose VLM. It adds an ImageCaptioner class, new CLI arguments, and updates the orchestration logic to manage an optional third server for image descriptions. A review comment suggests adding safety checks when parsing the VLM API response to prevent potential index errors if the response is empty.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ramalama/plugins/runtimes/inference/rag/cli.py`:
- Around line 85-91: The CLI currently sets --caption-model default to a
non-None value but rag_handler only starts captioning when caption_images is
true, making --caption-model a silent no-op; change the parser.add_argument for
"caption_model" to default=None and update rag_handler to treat an explicit
caption_model as enabling captioning (i.e., if args.caption_images or
args.caption_model then set caption_model = args.caption_model or CAPTION_MODEL
and start the caption server), or alternatively validate and raise an error if
args.caption_model is provided without args.caption_images—ensure references to
parser.add_argument("caption_model"), rag_handler, args.caption_images,
args.caption_model, and CAPTION_MODEL are updated accordingly.
In `@ramalama/plugins/runtimes/inference/rag/handler.py`:
- Around line 78-81: The captioning path currently constructs caption_transport
= New(caption_model, caption_serve_args) and proceeds to call
ensure_model_exists/serve_nonblocking even when New(...) returns an APITransport
(a hosted provider) which cannot be served locally; mirror the embedding-side
behavior by detecting APITransport after creating caption_transport (check
type/name or isinstance(caption_transport, APITransport)) and raise/return an
error or skip local serving when that is the case; update the code around
caption_transport, caption_model, caption_serve_args and the New(...) call to
reject hosted API models for the caption-server path before calling
ensure_model_exists or serve_nonblocking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9c7b7eb5-4c1e-4241-8f57-d2b061058463
📒 Files selected for processing (5)
container-images/scripts/doc2ragramalama/plugins/runtimes/inference/llama_cpp.pyramalama/plugins/runtimes/inference/llama_cpp_commands.pyramalama/plugins/runtimes/inference/rag/cli.pyramalama/plugins/runtimes/inference/rag/handler.py
199c043 to
a9792a9
Compare
a9792a9 to
7cb6092
Compare
7cb6092 to
26bbdbe
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@container-images/scripts/doc2rag`:
- Around line 250-259: In the PictureItem branch (check for PictureItem,
captioner, item.image) ensure a failed or empty caption does not drop the image:
after attempting description = captioner.caption(pil_image) and in the except
block, set a fallback_text = (item.text.strip() if getattr(item, "text", None)
else "[Image]") and append parts.append(f"\n[Image: {fallback_text}]\n") so that
when description is falsy or an exception occurs you preserve the item's text
(or a neutral placeholder) instead of omitting the image; update the code paths
that currently only append when description is truthy to use this fallback logic
and keep existing logger.warning in the exception handler.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 93892d0b-41d9-49f4-b5f0-f4e329d77af4
📒 Files selected for processing (6)
container-images/scripts/doc2ragdocs/ramalama-rag.1.mdramalama/plugins/runtimes/inference/llama_cpp.pyramalama/plugins/runtimes/inference/llama_cpp_commands.pyramalama/plugins/runtimes/inference/rag/cli.pyramalama/plugins/runtimes/inference/rag/handler.py
✅ Files skipped from review due to trivial changes (1)
- ramalama/plugins/runtimes/inference/llama_cpp_commands.py
🚧 Files skipped from review as they are similar to previous changes (2)
- ramalama/plugins/runtimes/inference/llama_cpp.py
- ramalama/plugins/runtimes/inference/rag/handler.py
| elif isinstance(item, PictureItem) and captioner and item.image: | ||
| try: | ||
| pil_image = item.image.pil_image | ||
| if pil_image: | ||
| description = captioner.caption(pil_image) | ||
| if description: | ||
| parts.append(f"\n[Image: {description.strip()}]\n") | ||
| except Exception as e: | ||
| logger.warning("Failed to caption image: %s", e) | ||
| elif hasattr(item, "text") and item.text: |
There was a problem hiding this comment.
Preserve figure content when captioning fails.
In the PictureItem path, a failed/empty caption currently drops the image content entirely for that item. Add a fallback to existing item.text (or a neutral placeholder) so chunking does not silently lose information.
Proposed fix
- elif isinstance(item, PictureItem) and captioner and item.image:
- try:
- pil_image = item.image.pil_image
- if pil_image:
- description = captioner.caption(pil_image)
- if description:
- parts.append(f"\n[Image: {description.strip()}]\n")
- except Exception as e:
- logger.warning("Failed to caption image: %s", e)
+ elif isinstance(item, PictureItem) and captioner and item.image:
+ description = None
+ try:
+ pil_image = item.image.pil_image
+ if pil_image:
+ description = captioner.caption(pil_image)
+ except Exception as e:
+ logger.warning("Failed to caption image: %s", e)
+
+ if description:
+ parts.append(f"\n[Image: {description.strip()}]\n")
+ elif getattr(item, "text", None):
+ text = item.text.strip()
+ if text:
+ parts.append(text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@container-images/scripts/doc2rag` around lines 250 - 259, In the PictureItem
branch (check for PictureItem, captioner, item.image) ensure a failed or empty
caption does not drop the image: after attempting description =
captioner.caption(pil_image) and in the except block, set a fallback_text =
(item.text.strip() if getattr(item, "text", None) else "[Image]") and append
parts.append(f"\n[Image: {fallback_text}]\n") so that when description is falsy
or an exception occurs you preserve the item's text (or a neutral placeholder)
instead of omitting the image; update the code paths that currently only append
when description is truthy to use this fallback logic and keep existing
logger.warning in the exception handler.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Add optional image captioning during RAG document processing using a general-purpose VLM (default: Gemma 4). When --caption-images is enabled, a third llama.cpp server describes charts, diagrams, and photos found in documents, injecting descriptions into the text before chunking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Brian <bmahabir@bu.edu>
26bbdbe to
8e90f57
Compare
|
This looks really cool! Could you add some unit and e2e (slow) tests for it? |
|
@bmahabirbu Needs a rebase and then we could get this in. |
Summary
--caption-images [MODEL]flag toramalama ragthat optionally starts a third llama.cpp server running a general-purpose VLM (default: Gemma 4) to describe charts, diagrams, and photos found in documents[Image: ...]text into the document before chunking, improving RAG retrieval for image-heavy content--caption-imagesuses the default model,--caption-images hf://some/modeluses a custom one_cleanup_servers()to accept lists of servers/processes, supporting a variable number of containersTest plan
ramalama rag --caption-images ./docs/ myrag:latestwith documents containing images and verify captions appear in chunked textramalama rag ./docs/ myrag:latestwithout--caption-imagesand verify no captioning server is started (backward compatible)ramalama rag --caption-images hf://some/other-model ./docs/ myrag:latestto verify custom model override works🤖 Generated with Claude Code