Skip to content

[Parser] l/1 confusion in parenthesized numbers not corrected #11

@ilhamfp

Description

@ilhamfp

Parser Improvement: l/1 confusion in parenthesized numbers not corrected

Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1

Current Behavior

OCR reads '(1)' as '(l)' and '(2)' as 'l2l'. The existing patterns only handle 'Pasal l' contexts but not parenthesized ayat references like '(l)' or 'l2l' which are common in article cross-references and ayat numbering.

Proposed Fix

Add patterns to fix l/1 confusion in parenthesized number contexts: (l) -> (1), l2l -> (2), and similar patterns where 'l' is used as a digit within or around parentheses.

Code Before

    # Letter-digit confusion
    (re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'),  # Standalone Pasal l -> Pasal 1
    (re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'),  # Pasal l3 -> Pasal 13
    (re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'),  # 1O -> 10, 9O -> 90
    (re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'),  # Pasal 1O -> Pasal 10
    (re.compile(r'(?<=\s)l(?=\d{2,})'), '1'),  # l23 -> 123
    (re.compile(r'(?<=\d)l(?=\d)'), '1'),  # 2l3 -> 213

Code After

    # Letter-digit confusion
    (re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'),  # Standalone Pasal l -> Pasal 1
    (re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'),  # Pasal l3 -> Pasal 13
    (re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'),  # 1O -> 10, 9O -> 90
    (re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'),  # Pasal 1O -> Pasal 10
    (re.compile(r'(?<=\s)l(?=\d{2,})'), '1'),  # l23 -> 123
    (re.compile(r'(?<=\d)l(?=\d)'), '1'),  # 2l3 -> 213

    # Parenthesized number l/1 confusion: (l) -> (1), (l2) -> (12)
    (re.compile(r'\(l\)'), '(1)'),  # (l) -> (1)
    (re.compile(r'\(l(\d+)\)'), r'(1\1)'),  # (l2) -> (12)
    (re.compile(r'\((\d+)l\)'), r'(\g<1>1)'),  # (2l) -> (21)
    (re.compile(r'\((\d*)l(\d+)\)'), lambda m: '(' + (m.group(1) or '') + '1' + m.group(2) + ')'),  # (l2l) pattern
    # Standalone l<digit>l pattern at start of line (ayat marker): l2l -> (2)
    (re.compile(r'^l(\d+)l(?=\s)', re.MULTILINE), r'(\1)'),  # l2l Hukum -> (2) Hukum

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions