Skip to content

[Parser] UNDANG-UNDANG OCR pattern forces uppercase, preventing proper casing #13

@ilhamfp

Description

@ilhamfp

Parser Improvement: UNDANG-UNDANG OCR pattern forces uppercase, preventing proper casing

Severity: MEDIUM
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1

Current Behavior

The pattern r'\bUNDANG[\s-]*UNDANG\b' with replacement 'UNDANG-UNDANG' matches all case variants (due to re.IGNORECASE) and forces them to ALL CAPS. This converts legitimate mixed-case occurrences like 'Undang-Undang' in body text to 'UNDANG-UNDANG', which is incorrect. In Indonesian legal documents, 'UNDANG-UNDANG' is only used in titles/headers, while body text uses 'Undang-Undang'.

Proposed Fix

Change the UNDANG-UNDANG pattern to only fix spacing/hyphenation issues without forcing uppercase. Use a case-preserving approach: only normalize the hyphen and spacing, not the case. Remove re.IGNORECASE from this specific pattern or use a smarter replacement.

Code Before

    (re.compile(r'\bUNDANG[\s-]*UNDANG\b', re.IGNORECASE), 'UNDANG-UNDANG'),

Code After

    # Fix broken hyphen/spacing in UNDANG-UNDANG without changing case
    # Only fix the ALL CAPS version (titles/headers)
    (re.compile(r'\bUNDANG[\s]+UNDANG\b'), 'UNDANG-UNDANG'),
    # Fix mixed-case version with broken spacing
    (re.compile(r'\bUndang[\s]+Undang\b'), 'Undang-Undang'),
    (re.compile(r'\bundang[\s]+undang\b'), 'undang-undang'),

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions