Skip to content

[FEATURE] LLM-based Data Labeling — Automated annotation for ML training data #475

Description

@gelluisaac

Description

Use LLMs to automatically label and annotate data for ML model training,
reducing manual labeling effort while maintaining quality.

Scope

Build automated data labeling pipeline with LLM validation.

Files to Touch/Create

  • astroml/llm/labeling/__init__.py
  • astroml/llm/labeling/labeler.py — Core labeling logic
  • astroml/llm/labeling/schemas.py — Label schema definitions
  • astroml/llm/labeling/validators.py — Label validation
  • astroml/llm/labeling/consensus.py — Multi-LLM consensus
  • astroml/llm/labeling/human.py — Human-in-the-loop integration
  • astroml/tasks/labeling.py — Batch labeling worker

Labeling Tasks

  1. Transaction Classification: fraud/suspicious/legitimate
  2. Alert Categorization: pattern type, severity
  3. Entity Resolution: match accounts across sources
  4. Sentiment Analysis: user feedback categorization
  5. Named Entity Recognition: extract entities from text

Implementation Details

  • Prompt engineering for consistent labeling
  • Confidence scoring per label
  • Multi-LLM voting for uncertain cases
  • Human review queue for low-confidence labels
  • Active learning: prioritize informative samples
  • Label versioning and audit trail

Acceptance Criteria

  • Label accuracy >85% without human review
  • Labeling throughput >1000 items/hour
  • Cost <$0.01 per item
  • Low-confidence items routed to human review
  • Consensus improves accuracy to >95%
  • Complete audit trail of all labels

Quality Assurance

  • Inter-LLM agreement metrics
  • Random sample human review
  • Drift detection for label distributions
  • Feedback loop from model performance

Labels

enhancement, llm, data-labeling, ml

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions