Skip to content

[Parser] Double periods in list items ('i..') not cleaned #19

@ilhamfp

Description

@ilhamfp

Parser Improvement: Double periods in list items ('i..') not cleaned

Severity: LOW
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1

Current Behavior

List items ending with double periods like 'i..' are not corrected. This occurs when the OCR reads the list marker period and a sentence-ending period as two consecutive periods, or when a period from the next line gets merged.

Proposed Fix

Add a pattern to normalize double periods that appear at the end of list item markers or at the end of lines. A single-letter list marker followed by '..' should become the marker with a single period.

Code Before

    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines

Code After

    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines

    # Double period after list markers: 'i..' -> 'i.', 'a..' -> 'a.'
    (re.compile(r'^([a-z])\.\.(\s)', re.MULTILINE), r'\1.\2'),
    # Double period at end of line
    (re.compile(r'\.\.\s*$', re.MULTILINE), '.'),

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions