Parser Improvement: OCR garbled text (random character sequences) not filtered out
Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 2
Current Behavior
Garbled OCR artifacts like 'nrrFF[iriN]' pass through unfiltered, appearing in the middle of legal text and causing duplicate content around page boundaries.
Proposed Fix
Add a pattern to detect and remove lines consisting primarily of random character sequences — lines with high density of uppercase consonants, brackets, and no recognizable Indonesian words. Also remove lines with Cyrillic or other non-Latin/non-ASCII characters that are clearly OCR misreads of graphical elements.
Code Before
# Common scanner artifacts
(re.compile(r'^[;,.]$', re.MULTILINE), ''), # Lone punctuation on a line
(re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''), # Horizontal rules from scan lines
]
Code After
# Common scanner artifacts
(re.compile(r'^[;,.]$', re.MULTILINE), ''), # Lone punctuation on a line
(re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''), # Horizontal rules from scan lines
# Garbled OCR lines: random consonant clusters, brackets, no vowels
# e.g. 'nrrFF[iriN]', 'EtrN', ';*trE'
(re.compile(r'^\s*[;*]?[a-zA-Z]*(?:[A-Z]{2,}[\[\]()]+[a-zA-Z]*|[\[\]()]+[A-Z]{2,})[a-zA-Z\[\]()]*\s*$', re.MULTILINE), ''),
(re.compile(r'^\s*\S*[\[\]][A-Za-z]+[\[\]]\S*\s*$', re.MULTILINE), ''), # Lines with bracketed gibberish
# Cyrillic and other non-Latin characters from OCR misreading graphics/signatures
# e.g. Д, Ж, Ѽ, Œ used instead of dates, numbers, or logos
(re.compile(r'[\u0400-\u04FF\u0500-\u052F\u0152\u0153\u0460-\u047F]'), ''),
]
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.
Parser Improvement: OCR garbled text (random character sequences) not filtered out
Severity: HIGH
File:
ocr_correct.pyFunction:
correct_ocr_errorsEstimated errors fixed: 2
Current Behavior
Garbled OCR artifacts like 'nrrFF[iriN]' pass through unfiltered, appearing in the middle of legal text and causing duplicate content around page boundaries.
Proposed Fix
Add a pattern to detect and remove lines consisting primarily of random character sequences — lines with high density of uppercase consonants, brackets, and no recognizable Indonesian words. Also remove lines with Cyrillic or other non-Latin/non-ASCII characters that are clearly OCR misreads of graphical elements.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.