Parser Improvement: Missing OCR correction for dropped characters like 'p na' -> 'pidana'
Severity: MEDIUM
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1
Current Behavior
Common Indonesian legal terms with OCR-dropped characters (e.g., 'p na' for 'pidana', 'ket' for truncated words) are not corrected. The word 'pidana' is extremely common in legal text and its OCR corruption 'p na' (missing 'ida') should be caught.
Proposed Fix
Add patterns for common legal terms where OCR drops middle characters. 'pidana' is the highest-priority term since it appears hundreds of times in criminal law documents. Use word boundary matching to avoid false positives.
Code Before
# Common word-level OCR errors in Indonesian legal text
(re.compile(r'\bFRESIDEN\b', re.IGNORECASE), 'PRESIDEN'), # P->F OCR confusion
Code After
# Common word-level OCR errors in Indonesian legal text
# Dropped characters in common legal terms
(re.compile(r'\bp\s+na\b', re.IGNORECASE), 'pidana'), # p na -> pidana (dropped 'ida')
(re.compile(r'\bpida\s+na\b', re.IGNORECASE), 'pidana'), # pida na -> pidana
(re.compile(r'\bFRESIDEN\b', re.IGNORECASE), 'PRESIDEN'), # P->F OCR confusion
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.
Parser Improvement: Missing OCR correction for dropped characters like 'p na' -> 'pidana'
Severity: MEDIUM
File:
ocr_correct.pyFunction:
correct_ocr_errorsEstimated errors fixed: 1
Current Behavior
Common Indonesian legal terms with OCR-dropped characters (e.g., 'p na' for 'pidana', 'ket' for truncated words) are not corrected. The word 'pidana' is extremely common in legal text and its OCR corruption 'p na' (missing 'ida') should be caught.
Proposed Fix
Add patterns for common legal terms where OCR drops middle characters. 'pidana' is the highest-priority term since it appears hundreds of times in criminal law documents. Use word boundary matching to avoid false positives.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.