Parser Improvement: UNDANG-UNDANG OCR pattern forces uppercase, preventing proper casing
Severity: MEDIUM
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1
Current Behavior
The pattern r'\bUNDANG[\s-]*UNDANG\b' with replacement 'UNDANG-UNDANG' matches all case variants (due to re.IGNORECASE) and forces them to ALL CAPS. This converts legitimate mixed-case occurrences like 'Undang-Undang' in body text to 'UNDANG-UNDANG', which is incorrect. In Indonesian legal documents, 'UNDANG-UNDANG' is only used in titles/headers, while body text uses 'Undang-Undang'.
Proposed Fix
Change the UNDANG-UNDANG pattern to only fix spacing/hyphenation issues without forcing uppercase. Use a case-preserving approach: only normalize the hyphen and spacing, not the case. Remove re.IGNORECASE from this specific pattern or use a smarter replacement.
Code Before
(re.compile(r'\bUNDANG[\s-]*UNDANG\b', re.IGNORECASE), 'UNDANG-UNDANG'),
Code After
# Fix broken hyphen/spacing in UNDANG-UNDANG without changing case
# Only fix the ALL CAPS version (titles/headers)
(re.compile(r'\bUNDANG[\s]+UNDANG\b'), 'UNDANG-UNDANG'),
# Fix mixed-case version with broken spacing
(re.compile(r'\bUndang[\s]+Undang\b'), 'Undang-Undang'),
(re.compile(r'\bundang[\s]+undang\b'), 'undang-undang'),
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.
Parser Improvement: UNDANG-UNDANG OCR pattern forces uppercase, preventing proper casing
Severity: MEDIUM
File:
ocr_correct.pyFunction:
correct_ocr_errorsEstimated errors fixed: 1
Current Behavior
The pattern
r'\bUNDANG[\s-]*UNDANG\b'with replacement 'UNDANG-UNDANG' matches all case variants (due to re.IGNORECASE) and forces them to ALL CAPS. This converts legitimate mixed-case occurrences like 'Undang-Undang' in body text to 'UNDANG-UNDANG', which is incorrect. In Indonesian legal documents, 'UNDANG-UNDANG' is only used in titles/headers, while body text uses 'Undang-Undang'.Proposed Fix
Change the UNDANG-UNDANG pattern to only fix spacing/hyphenation issues without forcing uppercase. Use a case-preserving approach: only normalize the hyphen and spacing, not the case. Remove re.IGNORECASE from this specific pattern or use a smarter replacement.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.