Skip to content

[Parser] Missing OCR correction for dropped characters like 'p na' -> 'pidana' #18

@ilhamfp

Description

@ilhamfp

Parser Improvement: Missing OCR correction for dropped characters like 'p na' -> 'pidana'

Severity: MEDIUM
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1

Current Behavior

Common Indonesian legal terms with OCR-dropped characters (e.g., 'p na' for 'pidana', 'ket' for truncated words) are not corrected. The word 'pidana' is extremely common in legal text and its OCR corruption 'p na' (missing 'ida') should be caught.

Proposed Fix

Add patterns for common legal terms where OCR drops middle characters. 'pidana' is the highest-priority term since it appears hundreds of times in criminal law documents. Use word boundary matching to avoid false positives.

Code Before

    # Common word-level OCR errors in Indonesian legal text
    (re.compile(r'\bFRESIDEN\b', re.IGNORECASE), 'PRESIDEN'),     # P->F OCR confusion

Code After

    # Common word-level OCR errors in Indonesian legal text
    # Dropped characters in common legal terms
    (re.compile(r'\bp\s+na\b', re.IGNORECASE), 'pidana'),  # p na -> pidana (dropped 'ida')
    (re.compile(r'\bpida\s+na\b', re.IGNORECASE), 'pidana'),  # pida na -> pidana
    (re.compile(r'\bFRESIDEN\b', re.IGNORECASE), 'PRESIDEN'),     # P->F OCR confusion

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions