Skip to content

[Parser] l/1 confusion not corrected in ayat references like '(l)' and 'l2l' #16

@ilhamfp

Description

@ilhamfp

Parser Improvement: l/1 confusion not corrected in ayat references like '(l)' and 'l2l'

Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 2

Current Behavior

The OCR corrector handles 'Pasal l' -> 'Pasal 1' and digit-l-digit patterns, but does not handle lowercase 'l' inside parenthesized number references like '(l)' -> '(1)', or patterns like 'l2l' -> '(2)' where the parentheses themselves are misread as 'l'.

Proposed Fix

Add patterns to fix: (1) '(l)' where l should be a digit 1, (2) 'l' used as opening/closing parenthesis around digits (e.g., 'l2l' -> '(2)'), and (3) 'ayat (l)' references. These are systematic OCR errors where lowercase L is confused with both digit 1 and parenthesis characters.

Code Before

    (re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'),  # Standalone Pasal l -> Pasal 1
    (re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'),  # Pasal l3 -> Pasal 13
    (re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'),  # 1O -> 10, 9O -> 90
    (re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'),  # Pasal 1O -> Pasal 10
    (re.compile(r'(?<=\s)l(?=\d{2,})'), '1'),  # l23 -> 123
    (re.compile(r'(?<=\d)l(?=\d)'), '1'),  # 2l3 -> 213

Code After

    (re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'),  # Standalone Pasal l -> Pasal 1
    (re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'),  # Pasal l3 -> Pasal 13
    (re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'),  # 1O -> 10, 9O -> 90
    (re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'),  # Pasal 1O -> Pasal 10
    (re.compile(r'(?<=\s)l(?=\d{2,})'), '1'),  # l23 -> 123
    (re.compile(r'(?<=\d)l(?=\d)'), '1'),  # 2l3 -> 213

    # l/1 confusion in parenthesized references: (l) -> (1), ayat (l) -> ayat (1)
    (re.compile(r'\(l\)'), '(1)'),  # (l) -> (1)
    (re.compile(r'\(l(\d+)\)'), r'(1\1)'),  # (l2) -> (12)
    # l used as parenthesis around digits: l2l -> (2), l12l -> (12)
    (re.compile(r'(?:^|(?<=\s))l(\d+)l(?=\s|$|[^a-zA-Z])', re.MULTILINE), r'(\1)'),

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions