Agentic Correction System - Implementation Status

Last Updated: 2025-10-27 Status: Phase 1 and Phase 2 (Backend) Complete

Overview

This document tracks the implementation of the classification-first agentic correction system with human feedback loop as specified in the plan.

✅ Completed

Phase 1: Classification-First Correction Workflow

1.1 Gap Classification Schema ✅

File: lyrics_transcriber/correction/agentic/models/schemas.py

Added GapCategory enum with 8 categories:
- PUNCTUATION_ONLY: Style differences only
- SOUND_ALIKE: Homophones and similar-sounding errors
- BACKGROUND_VOCALS: Transcribed backing vocals in parentheses
- EXTRA_WORDS: Filler words like "And", "But"
- REPEATED_SECTION: Chorus repetitions
- COMPLEX_MULTI_ERROR: Large gaps with multiple error types
- AMBIGUOUS: Unclear without audio
- NO_ERROR: Matches at least one reference source
Added GapClassification model with fields:
- gap_id: Unique identifier
- category: Gap category
- confidence: 0-1 score
- reasoning: Explanation
- suggested_handler: Handler recommendation
Updated CorrectionProposal with:
- gap_category: Classification category
- requires_human_review: Flag for manual review
- artist, title: Song metadata
- Added "NoAction" and "Flag" to action types

1.2 Classification Prompt ✅

File: lyrics_transcriber/correction/agentic/prompts/classifier.py

Created build_classification_prompt() function that:
- Includes gap text, context, and reference lyrics from all sources
- Includes artist/title for proper noun context
- Provides few-shot examples from gaps_review.yaml
- Requests structured JSON output matching GapClassification schema
Implemented load_few_shot_examples() with:
- Dynamic loading from examples.yaml (if exists)
- Hardcoded fallback examples covering all categories
- Examples extracted directly from your manual gap annotations

1.3 Category-Specific Handlers ✅

Files: lyrics_transcriber/correction/agentic/handlers/

Implemented 8 handler classes:

PunctuationHandler: Returns NO_ACTION for style differences
NoErrorHandler: Returns NO_ACTION when reference matches
BackgroundVocalsHandler: Proposes DELETE for parenthesized content
ExtraWordsHandler: Detects and removes filler words
SoundAlikeHandler: Extracts replacement from reference context
RepeatedSectionHandler: Flags for human review
ComplexMultiErrorHandler: Flags complex gaps
AmbiguousHandler: Flags unclear cases

Created HandlerRegistry for mapping categories to handlers
All handlers extend BaseHandler abstract class
Handlers return CorrectionProposal objects with metadata

1.4 AgenticCorrector Workflow ✅

File: lyrics_transcriber/correction/agentic/agent.py

Added classify_gap() method:
- Builds classification prompt
- Calls AI provider for classification
- Returns GapClassification or None
Added propose_for_gap() method implementing two-step workflow:
1. Classify the gap using LLM
2. Route to appropriate handler based on category
3. Handler generates correction proposals
4. Add metadata (artist, title, category) to proposals
5. Handle errors gracefully with fallback to FLAG
Kept legacy propose() method marked as deprecated

1.5 LyricsCorrector Integration ✅

File: lyrics_transcriber/correction/corrector.py

Updated agentic correction section to:
- Prepare structured gap data (words with IDs, times)
- Extract context (10 preceding/following words)
- Build reference contexts from all sources
- Pass artist and title from metadata
- Call new propose_for_gap() method instead of old prompt-based approach

Phase 2: Human Feedback Collection System (Backend)

2.1 Correction Annotation Schema ✅

File: lyrics_transcriber/correction/feedback/schemas.py

Created CorrectionAnnotationType enum (9 types including MANUAL_EDIT)
Created CorrectionAction enum (7 actions: NO_ACTION, REPLACE, DELETE, INSERT, MERGE, SPLIT, FLAG)
Created CorrectionAnnotation model with:
- Unique annotation_id (UUID)
- Song identification (audio_hash, artist, title)
- Classification (annotation_type, action_taken)
- Content (original_text, corrected_text)
- Human metadata (confidence 1-5, reasoning min 10 chars)
- Agentic comparison (agentic_proposal, agentic_category, agentic_agreed)
- Reference tracking (reference_sources_consulted)
- Session tracking (session_id, timestamp)
Created AnnotationStatistics for aggregated metrics

2.2 Feedback Storage Backend ✅

File: lyrics_transcriber/correction/feedback/store.py

Implemented FeedbackStore class with JSONL storage:
- File: {cache_dir}/correction_annotations.jsonl
- One annotation per line for easy appending
- Automatic datetime serialization/deserialization
Methods implemented:
- save_annotation(): Save single annotation
- save_annotations(): Batch save
- get_all_annotations(): Load all with error recovery
- get_annotations_by_song(): Filter by audio hash
- get_annotations_by_category(): Filter by type
- get_statistics(): Aggregate metrics (counts, averages, patterns)
- export_to_training_data(): Export high-confidence annotations for fine-tuning

2.3 Backend API Endpoints ✅

File: lyrics_transcriber/review/server.py

Initialized NewFeedbackStore in ReviewServer.__init__()
Added 3 new API endpoints:
- POST /api/v1/annotations: Save annotation with validation
- GET /api/v1/annotations/{audio_hash}: Get annotations for song
- GET /api/v1/annotations/stats: Get aggregated statistics
All endpoints include proper error handling and HTTP status codes

🚧 In Progress / Not Started

Phase 2: Human Feedback Collection (Frontend)

2.4 UI Annotation Modal Component ⏳

File: lyrics_transcriber/frontend/src/components/CorrectionAnnotationModal.tsx (to create)

Required: React modal component with:

Annotation type dropdown (9 categories)
Confidence slider (1-5 scale)
Reasoning textarea (required, min 10 chars)
Display of agentic AI suggestion (if applicable)
Display of reference lyrics context
"Save & Continue" and "Skip" buttons
Local state management until final submission

2.5 Edit Workflow Integration ⏳

Files:

lyrics_transcriber/frontend/src/components/EditModal.tsx
lyrics_transcriber/frontend/src/components/EditWordList.tsx

Required: Wrap edit actions to trigger annotation modal:

Show modal after user confirms word edit/delete/merge/split
Collect annotation data in React state
Submit all annotations on "Finish Review"
Add settings toggle for "Enable correction annotations"

2.6 Frontend Types and API Client ⏳

Files:

lyrics_transcriber/frontend/src/types.ts
lyrics_transcriber/frontend/src/api.ts

Required:

Add CorrectionAnnotation TypeScript interface
Add submitAnnotations() method to API client
Add getAnnotationStats() method

Phase 3: Continuous Improvement Infrastructure

3.1 Analysis Scripts ⏳

File: scripts/analyze_annotations.py (to create)

Required: Python script that:

Loads all annotations from JSONL
Generates Markdown report with:
- Most common error categories
- Agentic AI accuracy by category
- Frequently mis-heard words/phrases
- Cases where reference lyrics were wrong
Outputs to CORRECTION_ANALYSIS.md

3.2 Few-Shot Example Generator ⏳

File: scripts/generate_few_shot_examples.py (to create)

Required: Python script that:

Selects high-confidence annotations (confidence >= 4)
Formats as YAML prompt examples
Outputs to lyrics_transcriber/correction/agentic/prompts/examples.yaml
Can be run periodically to update classifier

3.3 Classifier Dynamic Examples ⏳

File: lyrics_transcriber/correction/agentic/prompts/classifier.py (update)

Required:

Already has load_few_shot_examples() infrastructure
Will automatically load from examples.yaml if file exists
No changes needed - just need to generate the YAML file

3.4 Feedback Loop Documentation ⏳

File: HUMAN_FEEDBACK_LOOP.md (to create)

Required: Document:

How to use annotation collection in UI
How to run analysis scripts
How to regenerate few-shot examples
How to evaluate improvement over time
Path to fine-tuning custom model with RLHF

Phase 4: Testing and Validation

4.1 Unit Tests ⏳

File: tests/unit/correction/test_classifier.py (to create)

Required: Test cases for:

Gap classifier with examples from gaps_review.yaml
Each category handler
Edge cases (ambiguous, no reference match)

4.2 Integration Tests ⏳

File: tests/integration/test_agentic_workflow.py (update)

Required: End-to-end tests:

Classification → correction flow
Use Time Bomb song as fixture
Verify correct handlers are invoked
Verify FLAG actions for ambiguous cases

4.3 Feedback System Tests ⏳

File: tests/unit/correction/test_feedback_store.py (to create)

Required: Test cases for:

Save and retrieve annotations
JSONL format correctness
Statistics generation
Training data export

Testing the Implementation

Backend Testing (Can be done now)

Test Classification Workflow:

USE_AGENTIC_AI=1 python -m lyrics_transcriber.cli.cli_main Time-Bomb.flac \
  --artist "Rancid" --title "Time Bomb"

Expected: Gaps will be classified into categories, handlers will propose corrections

Test Annotation Storage:

from lyrics_transcriber.correction.feedback.store import FeedbackStore
from lyrics_transcriber.correction.feedback.schemas import CorrectionAnnotation, CorrectionAnnotationType, CorrectionAction

store = FeedbackStore("cache")
annotation = CorrectionAnnotation(
    audio_hash="test123",
    annotation_type=CorrectionAnnotationType.SOUND_ALIKE,
    action_taken=CorrectionAction.REPLACE,
    original_text="out",
    corrected_text="now",
    confidence=5.0,
    reasoning="Reference lyrics confirm it should be 'now'",
    artist="Rancid",
    title="Time Bomb",
    session_id="test_session"
)
store.save_annotation(annotation)
stats = store.get_statistics()
print(stats)

Test API Endpoints:

# Start review server and test endpoints
curl -X POST http://localhost:8000/api/v1/annotations \
  -H "Content-Type: application/json" \
  -d '{"audio_hash": "test", "annotation_type": "sound_alike", ...}'

curl http://localhost:8000/api/v1/annotations/stats

Frontend Testing (Once UI is implemented)

Open review UI
Make a correction (edit/delete/merge word)
Annotation modal should appear
Fill in annotation details
Click "Save & Continue"
Make more corrections
Click "Finish Review"
Check cache/correction_annotations.jsonl for saved data

Architecture Decisions

Classification-First Approach

Why: Breaks complex problem into simpler steps
Benefit: Each handler can focus on one error type
Trade-off: Two LLM calls per gap (classification + handler logic), but handlers are deterministic

JSONL Storage

Why: Simple, append-only, no database required
Benefit: Easy to parse, version control friendly, portable
Trade-off: Not suitable for millions of annotations (but we expect hundreds/thousands)

Fail-Fast on Errors

Why: Better to flag for human review than make wrong correction
Benefit: Maintains transcription quality
Trade-off: More human review needed initially (improves over time with feedback)

Handler Registry Pattern

Why: Easy to add new category handlers
Benefit: Extensible, testable, follows Open/Closed Principle
Trade-off: Slightly more complex than if/else routing

Next Steps

Priority 1 (Required for feedback loop):

Implement CorrectionAnnotationModal.tsx
Integrate modal into edit workflow
Update frontend types and API client
Test end-to-end annotation collection

Priority 2 (Analysis and improvement): 5. Create analyze_annotations.py script 6. Create generate_few_shot_examples.py script 7. Generate initial examples.yaml from collected data 8. Document feedback loop in HUMAN_FEEDBACK_LOOP.md

Priority 3 (Validation): 9. Write unit tests for classifiers and handlers 10. Write integration tests for full workflow 11. Write tests for feedback store

Files Created

Python Backend

lyrics_transcriber/correction/agentic/models/schemas.py (updated)
lyrics_transcriber/correction/agentic/prompts/__init__.py (new)
lyrics_transcriber/correction/agentic/prompts/classifier.py (new)
lyrics_transcriber/correction/agentic/handlers/__init__.py (new)
lyrics_transcriber/correction/agentic/handlers/base.py (new)
lyrics_transcriber/correction/agentic/handlers/punctuation.py (new)
lyrics_transcriber/correction/agentic/handlers/no_error.py (new)
lyrics_transcriber/correction/agentic/handlers/background_vocals.py (new)
lyrics_transcriber/correction/agentic/handlers/extra_words.py (new)
lyrics_transcriber/correction/agentic/handlers/sound_alike.py (new)
lyrics_transcriber/correction/agentic/handlers/repeated_section.py (new)
lyrics_transcriber/correction/agentic/handlers/complex_multi_error.py (new)
lyrics_transcriber/correction/agentic/handlers/ambiguous.py (new)
lyrics_transcriber/correction/agentic/handlers/registry.py (new)
lyrics_transcriber/correction/agentic/agent.py (updated)
lyrics_transcriber/correction/corrector.py (updated)
lyrics_transcriber/correction/feedback/__init__.py (new)
lyrics_transcriber/correction/feedback/schemas.py (new)
lyrics_transcriber/correction/feedback/store.py (new)
lyrics_transcriber/review/server.py (updated)

Documentation

AGENTIC_IMPLEMENTATION_STATUS.md (this file)

Known Issues / Limitations

Classification accuracy depends on LLM quality
- Solution: Use better models (GPT-4, Claude Sonnet) for classification
- Solution: Collect human feedback to improve prompts
SoundAlikeHandler may fail to extract replacement
- Fallback: Flags for human review
- Future: Use fuzzy matching and phonetic algorithms
No A/B testing framework yet
- Can't compare different prompt versions easily
- Future enhancement: Track model performance over time
Frontend not implemented
- Can't collect human feedback yet
- Priority for next implementation phase

Fixes Applied

Enum Value Case Mismatch (2025-10-27)

Issue: LLM was returning category values in uppercase (e.g., "SOUND_ALIKE") but Pydantic enum expected lowercase with underscores (e.g., "sound_alike"), causing validation errors.

Fix: Updated all enum values in GapCategory and CorrectionAnnotationType to use uppercase format that LLMs naturally return. Also updated the prompt to explicitly show the expected format.

Files Changed:

lyrics_transcriber/correction/agentic/models/schemas.py
lyrics_transcriber/correction/feedback/schemas.py
lyrics_transcriber/correction/agentic/prompts/classifier.py

JSON Parsing with Invalid Escapes (2025-10-27)

Issue: LLM responses contained invalid JSON escape sequences like \' (e.g., in "out, I\'m"), which caused JSON parsing to fail. Python's json.loads() only allows specific escape sequences (\", \\, \/, \b, \f, \n, \r, \t).

Fix: Enhanced ResponseParser to automatically fix common JSON issues before parsing:

Replace invalid \' with ' (single quotes don't need escaping in JSON)
Remove trailing commas before } or ]
Retry parsing after fixes before falling back to raw response