Skip to content

Add VLM inference infrastructure: engine, protocol, and CLI support#65

Draft
stikves wants to merge 1 commit into
apple:mainfrom
stikves:sukru/vlm-infra
Draft

Add VLM inference infrastructure: engine, protocol, and CLI support#65
stikves wants to merge 1 commit into
apple:mainfrom
stikves:sukru/vlm-infra

Conversation

@stikves

@stikves stikves commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Runtime:

  • MultimodalInferenceEngine protocol with encodeImage() and generate()
  • CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens + LLM decoder with scatter-merge of image embeddings at placeholder positions
  • EmbeddedInput type wrapping NDArray embeddings with position metadata
  • VisionConfig in LanguageConfig for image_size, patch_size, token count/id
  • LanguageBundle parses top-level "vision" block from metadata.json

CLI (llm-runner):

  • --image flag routes through VLM engine when bundle kind is .vlm
  • Chat template detection with generic fallback for prompt construction
  • Accumulated token decode for correct spacing
  • Stop sequence support in VLM path

Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel, model.aimodel) with a vision config block in metadata.json. Model-family- specific export code lives in internal/python.

Runtime:
- MultimodalInferenceEngine protocol with encodeImage() and generate()
- CoreAISequentialVLMEngine: vision encoder + projector + embed_tokens +
  LLM decoder with scatter-merge of image embeddings at placeholder positions
- EmbeddedInput type wrapping NDArray embeddings with position metadata
- VisionConfig in LanguageConfig for image_size, patch_size, token count/id
- LanguageBundle parses top-level "vision" block from metadata.json

CLI (llm-runner):
- --image flag routes through VLM engine when bundle kind is .vlm
- Chat template detection with generic fallback for prompt construction
- Accumulated token decode for correct spacing
- Stop sequence support in VLM path

Supports any VLM that exports 3 components (vision.aimodel, embed.aimodel,
model.aimodel) with a vision config block in metadata.json. Model-family-
specific export code lives in internal/python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant