Skip to content

[Parser] OCR garbled text (random character sequences) not filtered out #10

@ilhamfp

Description

@ilhamfp

Parser Improvement: OCR garbled text (random character sequences) not filtered out

Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 2

Current Behavior

Garbled OCR artifacts like 'nrrFF[iriN]' pass through unfiltered, appearing in the middle of legal text and causing duplicate content around page boundaries.

Proposed Fix

Add a pattern to detect and remove lines consisting primarily of random character sequences — lines with high density of uppercase consonants, brackets, and no recognizable Indonesian words. Also remove lines with Cyrillic or other non-Latin/non-ASCII characters that are clearly OCR misreads of graphical elements.

Code Before

    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines
]

Code After

    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines

    # Garbled OCR lines: random consonant clusters, brackets, no vowels
    # e.g. 'nrrFF[iriN]', 'EtrN', ';*trE'
    (re.compile(r'^\s*[;*]?[a-zA-Z]*(?:[A-Z]{2,}[\[\]()]+[a-zA-Z]*|[\[\]()]+[A-Z]{2,})[a-zA-Z\[\]()]*\s*$', re.MULTILINE), ''),
    (re.compile(r'^\s*\S*[\[\]][A-Za-z]+[\[\]]\S*\s*$', re.MULTILINE), ''),  # Lines with bracketed gibberish

    # Cyrillic and other non-Latin characters from OCR misreading graphics/signatures
    # e.g. Д, Ж, Ѽ, Œ used instead of dates, numbers, or logos
    (re.compile(r'[\u0400-\u04FF\u0500-\u052F\u0152\u0153\u0460-\u047F]'), ''),
]

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions