Parser Improvement: l/1 confusion in parenthesized numbers not corrected
Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 1
Current Behavior
OCR reads '(1)' as '(l)' and '(2)' as 'l2l'. The existing patterns only handle 'Pasal l' contexts but not parenthesized ayat references like '(l)' or 'l2l' which are common in article cross-references and ayat numbering.
Proposed Fix
Add patterns to fix l/1 confusion in parenthesized number contexts: (l) -> (1), l2l -> (2), and similar patterns where 'l' is used as a digit within or around parentheses.
Code Before
# Letter-digit confusion
(re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'), # Standalone Pasal l -> Pasal 1
(re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'), # Pasal l3 -> Pasal 13
(re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'), # 1O -> 10, 9O -> 90
(re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'), # Pasal 1O -> Pasal 10
(re.compile(r'(?<=\s)l(?=\d{2,})'), '1'), # l23 -> 123
(re.compile(r'(?<=\d)l(?=\d)'), '1'), # 2l3 -> 213
Code After
# Letter-digit confusion
(re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'), # Standalone Pasal l -> Pasal 1
(re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'), # Pasal l3 -> Pasal 13
(re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'), # 1O -> 10, 9O -> 90
(re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'), # Pasal 1O -> Pasal 10
(re.compile(r'(?<=\s)l(?=\d{2,})'), '1'), # l23 -> 123
(re.compile(r'(?<=\d)l(?=\d)'), '1'), # 2l3 -> 213
# Parenthesized number l/1 confusion: (l) -> (1), (l2) -> (12)
(re.compile(r'\(l\)'), '(1)'), # (l) -> (1)
(re.compile(r'\(l(\d+)\)'), r'(1\1)'), # (l2) -> (12)
(re.compile(r'\((\d+)l\)'), r'(\g<1>1)'), # (2l) -> (21)
(re.compile(r'\((\d*)l(\d+)\)'), lambda m: '(' + (m.group(1) or '') + '1' + m.group(2) + ')'), # (l2l) pattern
# Standalone l<digit>l pattern at start of line (ayat marker): l2l -> (2)
(re.compile(r'^l(\d+)l(?=\s)', re.MULTILINE), r'(\1)'), # l2l Hukum -> (2) Hukum
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.
Parser Improvement: l/1 confusion in parenthesized numbers not corrected
Severity: HIGH
File:
ocr_correct.pyFunction:
correct_ocr_errorsEstimated errors fixed: 1
Current Behavior
OCR reads '(1)' as '(l)' and '(2)' as 'l2l'. The existing patterns only handle 'Pasal l' contexts but not parenthesized ayat references like '(l)' or 'l2l' which are common in article cross-references and ayat numbering.
Proposed Fix
Add patterns to fix l/1 confusion in parenthesized number contexts: (l) -> (1), l2l -> (2), and similar patterns where 'l' is used as a digit within or around parentheses.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.