Last Updated: 2025-10-27 Status: Phase 1 and Phase 2 (Backend) Complete
This document tracks the implementation of the classification-first agentic correction system with human feedback loop as specified in the plan.
File: lyrics_transcriber/correction/agentic/models/schemas.py
-
Added
GapCategoryenum with 8 categories:PUNCTUATION_ONLY: Style differences onlySOUND_ALIKE: Homophones and similar-sounding errorsBACKGROUND_VOCALS: Transcribed backing vocals in parenthesesEXTRA_WORDS: Filler words like "And", "But"REPEATED_SECTION: Chorus repetitionsCOMPLEX_MULTI_ERROR: Large gaps with multiple error typesAMBIGUOUS: Unclear without audioNO_ERROR: Matches at least one reference source
-
Added
GapClassificationmodel with fields:gap_id: Unique identifiercategory: Gap categoryconfidence: 0-1 scorereasoning: Explanationsuggested_handler: Handler recommendation
-
Updated
CorrectionProposalwith:gap_category: Classification categoryrequires_human_review: Flag for manual reviewartist,title: Song metadata- Added "NoAction" and "Flag" to action types
File: lyrics_transcriber/correction/agentic/prompts/classifier.py
-
Created
build_classification_prompt()function that:- Includes gap text, context, and reference lyrics from all sources
- Includes artist/title for proper noun context
- Provides few-shot examples from
gaps_review.yaml - Requests structured JSON output matching
GapClassificationschema
-
Implemented
load_few_shot_examples()with:- Dynamic loading from
examples.yaml(if exists) - Hardcoded fallback examples covering all categories
- Examples extracted directly from your manual gap annotations
- Dynamic loading from
Files: lyrics_transcriber/correction/agentic/handlers/
Implemented 8 handler classes:
PunctuationHandler: Returns NO_ACTION for style differencesNoErrorHandler: Returns NO_ACTION when reference matchesBackgroundVocalsHandler: Proposes DELETE for parenthesized contentExtraWordsHandler: Detects and removes filler wordsSoundAlikeHandler: Extracts replacement from reference contextRepeatedSectionHandler: Flags for human reviewComplexMultiErrorHandler: Flags complex gapsAmbiguousHandler: Flags unclear cases
- Created
HandlerRegistryfor mapping categories to handlers - All handlers extend
BaseHandlerabstract class - Handlers return
CorrectionProposalobjects with metadata
File: lyrics_transcriber/correction/agentic/agent.py
-
Added
classify_gap()method:- Builds classification prompt
- Calls AI provider for classification
- Returns
GapClassificationor None
-
Added
propose_for_gap()method implementing two-step workflow:- Classify the gap using LLM
- Route to appropriate handler based on category
- Handler generates correction proposals
- Add metadata (artist, title, category) to proposals
- Handle errors gracefully with fallback to FLAG
-
Kept legacy
propose()method marked as deprecated
File: lyrics_transcriber/correction/corrector.py
- Updated agentic correction section to:
- Prepare structured gap data (words with IDs, times)
- Extract context (10 preceding/following words)
- Build reference contexts from all sources
- Pass artist and title from metadata
- Call new
propose_for_gap()method instead of old prompt-based approach
File: lyrics_transcriber/correction/feedback/schemas.py
-
Created
CorrectionAnnotationTypeenum (9 types including MANUAL_EDIT) -
Created
CorrectionActionenum (7 actions: NO_ACTION, REPLACE, DELETE, INSERT, MERGE, SPLIT, FLAG) -
Created
CorrectionAnnotationmodel with:- Unique
annotation_id(UUID) - Song identification (
audio_hash,artist,title) - Classification (
annotation_type,action_taken) - Content (
original_text,corrected_text) - Human metadata (
confidence1-5,reasoningmin 10 chars) - Agentic comparison (
agentic_proposal,agentic_category,agentic_agreed) - Reference tracking (
reference_sources_consulted) - Session tracking (
session_id,timestamp)
- Unique
-
Created
AnnotationStatisticsfor aggregated metrics
File: lyrics_transcriber/correction/feedback/store.py
-
Implemented
FeedbackStoreclass with JSONL storage:- File:
{cache_dir}/correction_annotations.jsonl - One annotation per line for easy appending
- Automatic datetime serialization/deserialization
- File:
-
Methods implemented:
save_annotation(): Save single annotationsave_annotations(): Batch saveget_all_annotations(): Load all with error recoveryget_annotations_by_song(): Filter by audio hashget_annotations_by_category(): Filter by typeget_statistics(): Aggregate metrics (counts, averages, patterns)export_to_training_data(): Export high-confidence annotations for fine-tuning
File: lyrics_transcriber/review/server.py
-
Initialized
NewFeedbackStoreinReviewServer.__init__() -
Added 3 new API endpoints:
POST /api/v1/annotations: Save annotation with validationGET /api/v1/annotations/{audio_hash}: Get annotations for songGET /api/v1/annotations/stats: Get aggregated statistics
-
All endpoints include proper error handling and HTTP status codes
File: lyrics_transcriber/frontend/src/components/CorrectionAnnotationModal.tsx (to create)
Required: React modal component with:
- Annotation type dropdown (9 categories)
- Confidence slider (1-5 scale)
- Reasoning textarea (required, min 10 chars)
- Display of agentic AI suggestion (if applicable)
- Display of reference lyrics context
- "Save & Continue" and "Skip" buttons
- Local state management until final submission
Files:
lyrics_transcriber/frontend/src/components/EditModal.tsxlyrics_transcriber/frontend/src/components/EditWordList.tsx
Required: Wrap edit actions to trigger annotation modal:
- Show modal after user confirms word edit/delete/merge/split
- Collect annotation data in React state
- Submit all annotations on "Finish Review"
- Add settings toggle for "Enable correction annotations"
Files:
lyrics_transcriber/frontend/src/types.tslyrics_transcriber/frontend/src/api.ts
Required:
- Add
CorrectionAnnotationTypeScript interface - Add
submitAnnotations()method to API client - Add
getAnnotationStats()method
File: scripts/analyze_annotations.py (to create)
Required: Python script that:
- Loads all annotations from JSONL
- Generates Markdown report with:
- Most common error categories
- Agentic AI accuracy by category
- Frequently mis-heard words/phrases
- Cases where reference lyrics were wrong
- Outputs to
CORRECTION_ANALYSIS.md
File: scripts/generate_few_shot_examples.py (to create)
Required: Python script that:
- Selects high-confidence annotations (confidence >= 4)
- Formats as YAML prompt examples
- Outputs to
lyrics_transcriber/correction/agentic/prompts/examples.yaml - Can be run periodically to update classifier
File: lyrics_transcriber/correction/agentic/prompts/classifier.py (update)
Required:
- Already has
load_few_shot_examples()infrastructure - Will automatically load from
examples.yamlif file exists - No changes needed - just need to generate the YAML file
File: HUMAN_FEEDBACK_LOOP.md (to create)
Required: Document:
- How to use annotation collection in UI
- How to run analysis scripts
- How to regenerate few-shot examples
- How to evaluate improvement over time
- Path to fine-tuning custom model with RLHF
File: tests/unit/correction/test_classifier.py (to create)
Required: Test cases for:
- Gap classifier with examples from
gaps_review.yaml - Each category handler
- Edge cases (ambiguous, no reference match)
File: tests/integration/test_agentic_workflow.py (update)
Required: End-to-end tests:
- Classification → correction flow
- Use Time Bomb song as fixture
- Verify correct handlers are invoked
- Verify FLAG actions for ambiguous cases
File: tests/unit/correction/test_feedback_store.py (to create)
Required: Test cases for:
- Save and retrieve annotations
- JSONL format correctness
- Statistics generation
- Training data export
- Test Classification Workflow:
USE_AGENTIC_AI=1 python -m lyrics_transcriber.cli.cli_main Time-Bomb.flac \
--artist "Rancid" --title "Time Bomb"Expected: Gaps will be classified into categories, handlers will propose corrections
- Test Annotation Storage:
from lyrics_transcriber.correction.feedback.store import FeedbackStore
from lyrics_transcriber.correction.feedback.schemas import CorrectionAnnotation, CorrectionAnnotationType, CorrectionAction
store = FeedbackStore("cache")
annotation = CorrectionAnnotation(
audio_hash="test123",
annotation_type=CorrectionAnnotationType.SOUND_ALIKE,
action_taken=CorrectionAction.REPLACE,
original_text="out",
corrected_text="now",
confidence=5.0,
reasoning="Reference lyrics confirm it should be 'now'",
artist="Rancid",
title="Time Bomb",
session_id="test_session"
)
store.save_annotation(annotation)
stats = store.get_statistics()
print(stats)- Test API Endpoints:
# Start review server and test endpoints
curl -X POST http://localhost:8000/api/v1/annotations \
-H "Content-Type: application/json" \
-d '{"audio_hash": "test", "annotation_type": "sound_alike", ...}'
curl http://localhost:8000/api/v1/annotations/stats- Open review UI
- Make a correction (edit/delete/merge word)
- Annotation modal should appear
- Fill in annotation details
- Click "Save & Continue"
- Make more corrections
- Click "Finish Review"
- Check
cache/correction_annotations.jsonlfor saved data
- Why: Breaks complex problem into simpler steps
- Benefit: Each handler can focus on one error type
- Trade-off: Two LLM calls per gap (classification + handler logic), but handlers are deterministic
- Why: Simple, append-only, no database required
- Benefit: Easy to parse, version control friendly, portable
- Trade-off: Not suitable for millions of annotations (but we expect hundreds/thousands)
- Why: Better to flag for human review than make wrong correction
- Benefit: Maintains transcription quality
- Trade-off: More human review needed initially (improves over time with feedback)
- Why: Easy to add new category handlers
- Benefit: Extensible, testable, follows Open/Closed Principle
- Trade-off: Slightly more complex than if/else routing
Priority 1 (Required for feedback loop):
- Implement
CorrectionAnnotationModal.tsx - Integrate modal into edit workflow
- Update frontend types and API client
- Test end-to-end annotation collection
Priority 2 (Analysis and improvement):
5. Create analyze_annotations.py script
6. Create generate_few_shot_examples.py script
7. Generate initial examples.yaml from collected data
8. Document feedback loop in HUMAN_FEEDBACK_LOOP.md
Priority 3 (Validation): 9. Write unit tests for classifiers and handlers 10. Write integration tests for full workflow 11. Write tests for feedback store
lyrics_transcriber/correction/agentic/models/schemas.py(updated)lyrics_transcriber/correction/agentic/prompts/__init__.py(new)lyrics_transcriber/correction/agentic/prompts/classifier.py(new)lyrics_transcriber/correction/agentic/handlers/__init__.py(new)lyrics_transcriber/correction/agentic/handlers/base.py(new)lyrics_transcriber/correction/agentic/handlers/punctuation.py(new)lyrics_transcriber/correction/agentic/handlers/no_error.py(new)lyrics_transcriber/correction/agentic/handlers/background_vocals.py(new)lyrics_transcriber/correction/agentic/handlers/extra_words.py(new)lyrics_transcriber/correction/agentic/handlers/sound_alike.py(new)lyrics_transcriber/correction/agentic/handlers/repeated_section.py(new)lyrics_transcriber/correction/agentic/handlers/complex_multi_error.py(new)lyrics_transcriber/correction/agentic/handlers/ambiguous.py(new)lyrics_transcriber/correction/agentic/handlers/registry.py(new)lyrics_transcriber/correction/agentic/agent.py(updated)lyrics_transcriber/correction/corrector.py(updated)lyrics_transcriber/correction/feedback/__init__.py(new)lyrics_transcriber/correction/feedback/schemas.py(new)lyrics_transcriber/correction/feedback/store.py(new)lyrics_transcriber/review/server.py(updated)
AGENTIC_IMPLEMENTATION_STATUS.md(this file)
-
Classification accuracy depends on LLM quality
- Solution: Use better models (GPT-4, Claude Sonnet) for classification
- Solution: Collect human feedback to improve prompts
-
SoundAlikeHandler may fail to extract replacement
- Fallback: Flags for human review
- Future: Use fuzzy matching and phonetic algorithms
-
No A/B testing framework yet
- Can't compare different prompt versions easily
- Future enhancement: Track model performance over time
-
Frontend not implemented
- Can't collect human feedback yet
- Priority for next implementation phase
Issue: LLM was returning category values in uppercase (e.g., "SOUND_ALIKE") but Pydantic enum expected lowercase with underscores (e.g., "sound_alike"), causing validation errors.
Fix: Updated all enum values in GapCategory and CorrectionAnnotationType to use uppercase format that LLMs naturally return. Also updated the prompt to explicitly show the expected format.
Files Changed:
lyrics_transcriber/correction/agentic/models/schemas.pylyrics_transcriber/correction/feedback/schemas.pylyrics_transcriber/correction/agentic/prompts/classifier.py
Issue: LLM responses contained invalid JSON escape sequences like \' (e.g., in "out, I\'m"), which caused JSON parsing to fail. Python's json.loads() only allows specific escape sequences (\", \\, \/, \b, \f, \n, \r, \t).
Fix: Enhanced ResponseParser to automatically fix common JSON issues before parsing:
- Replace invalid
\'with'(single quotes don't need escaping in JSON) - Remove trailing commas before
}or] - Retry parsing after fixes before falling back to raw response
Files Changed:
lyrics_transcriber/correction/agentic/providers/response_parser.py
Result: System now handles LLM responses with imperfect JSON formatting gracefully.
-
Two LLM calls per gap: Classification + handler logic (if handler needs LLM)
- Most handlers are deterministic (no LLM call)
- Only
SoundAlikeHandlermight need LLM for complex cases
-
JSONL file grows linearly:
- ~1KB per annotation
- 1000 annotations = ~1MB
- Should be fine for years of use
-
Frontend modal after each edit:
- Could be annoying for many edits
- Consider: Batch annotation at end instead
- Add "Skip" button and settings toggle
Once implemented, track:
- Annotation Collection Rate: % of corrections that get annotated
- Agentic Agreement Rate: % where human agrees with AI proposal
- Classification Accuracy: % correctly classified by category (human-verified)
- Correction Quality Over Time: Track accuracy improvements as more feedback collected
- Human Review Rate: % of gaps flagged vs auto-corrected (should decrease over time)
- Original plan:
.cursor/plans/agentic-correction-system-*.plan.md - Manual gap annotations:
gaps_review.yaml - Test song:
Time-Bomb.flacby Rancid