Parser Improvement: l/1 confusion not corrected in ayat references like '(l)' and 'l2l'
Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 2
Current Behavior
The OCR corrector handles 'Pasal l' -> 'Pasal 1' and digit-l-digit patterns, but does not handle lowercase 'l' inside parenthesized number references like '(l)' -> '(1)', or patterns like 'l2l' -> '(2)' where the parentheses themselves are misread as 'l'.
Proposed Fix
Add patterns to fix: (1) '(l)' where l should be a digit 1, (2) 'l' used as opening/closing parenthesis around digits (e.g., 'l2l' -> '(2)'), and (3) 'ayat (l)' references. These are systematic OCR errors where lowercase L is confused with both digit 1 and parenthesis characters.
Code Before
(re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'), # Standalone Pasal l -> Pasal 1
(re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'), # Pasal l3 -> Pasal 13
(re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'), # 1O -> 10, 9O -> 90
(re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'), # Pasal 1O -> Pasal 10
(re.compile(r'(?<=\s)l(?=\d{2,})'), '1'), # l23 -> 123
(re.compile(r'(?<=\d)l(?=\d)'), '1'), # 2l3 -> 213
Code After
(re.compile(r'^(Pasal)[ \t]+l\s*$', re.MULTILINE), r'\1 1'), # Standalone Pasal l -> Pasal 1
(re.compile(r'(?<=Pasal\s)[lI](\d+)', re.MULTILINE), r'1\1'), # Pasal l3 -> Pasal 13
(re.compile(r'(\d)O(?=\s|$|\n)'), lambda m: m.group(1) + '0'), # 1O -> 10, 9O -> 90
(re.compile(r'(?<=Pasal\s)(\d+)O\b'), lambda m: m.group(1) + '0'), # Pasal 1O -> Pasal 10
(re.compile(r'(?<=\s)l(?=\d{2,})'), '1'), # l23 -> 123
(re.compile(r'(?<=\d)l(?=\d)'), '1'), # 2l3 -> 213
# l/1 confusion in parenthesized references: (l) -> (1), ayat (l) -> ayat (1)
(re.compile(r'\(l\)'), '(1)'), # (l) -> (1)
(re.compile(r'\(l(\d+)\)'), r'(1\1)'), # (l2) -> (12)
# l used as parenthesis around digits: l2l -> (2), l12l -> (12)
(re.compile(r'(?:^|(?<=\s))l(\d+)l(?=\s|$|[^a-zA-Z])', re.MULTILINE), r'(\1)'),
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.
Parser Improvement: l/1 confusion not corrected in ayat references like '(l)' and 'l2l'
Severity: HIGH
File:
ocr_correct.pyFunction:
correct_ocr_errorsEstimated errors fixed: 2
Current Behavior
The OCR corrector handles 'Pasal l' -> 'Pasal 1' and digit-l-digit patterns, but does not handle lowercase 'l' inside parenthesized number references like '(l)' -> '(1)', or patterns like 'l2l' -> '(2)' where the parentheses themselves are misread as 'l'.
Proposed Fix
Add patterns to fix: (1) '(l)' where l should be a digit 1, (2) 'l' used as opening/closing parenthesis around digits (e.g., 'l2l' -> '(2)'), and (3) 'ayat (l)' references. These are systematic OCR errors where lowercase L is confused with both digit 1 and parenthesis characters.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.