Skip to content

[FEATURE] Multimodal LLM Support — Image and document understanding #469

Description

@gelluisaac

Description

Extend LLM capabilities to handle multimodal inputs including images, PDFs,
charts, and structured documents for richer analysis.

Scope

Build multimodal processing pipeline for document understanding and visual analysis.

Files to Touch/Create

  • astroml/llm/multimodal/__init__.py
  • astroml/llm/multimodal/vision.py — Vision model integration (GPT-4V, Claude)
  • astroml/llm/multimodal/ocr.py — OCR for documents and images
  • astroml/llm/multimodal/charts.py — Chart and graph understanding
  • astroml/llm/multimodal/processors.py — Image preprocessing
  • astroml/llm/multimodal/prompts.py — Multimodal prompt templates
  • api/routers/multimodal.py — Multimodal API endpoints

Supported Inputs

  1. Images:

    • Transaction receipts
    • ID documents (KYC)
    • Screenshots of fraud alerts
  2. Documents:

    • PDFs (invoices, reports)
    • Scanned documents
    • Excel/CSV files
  3. Charts:

    • Model performance charts
    • Transaction volume graphs
    • Financial statements

Implementation Details

  • GPT-4V or Claude 3 for vision tasks
  • Tesseract or enterprise OCR for text extraction
  • Image resizing and format conversion
  • Prompt engineering for vision tasks
  • Caching of extracted text/descriptions

Acceptance Criteria

  • Image classification accuracy >90%
  • OCR text extraction accuracy >95%
  • Chart data extraction matches ground truth
  • Processing time <3s per image
  • Supports common formats (PNG, JPG, PDF)
  • Multimodal prompts work with text-only fallback

Use Cases

  • Automated KYC document verification
  • Receipt scanning for loyalty points
  • Fraud evidence analysis (screenshots)
  • Chart interpretation for reports

Labels

enhancement, llm, multimodal, vision

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions