[Parser] OCR garbled text (random character sequences) not filtered out

## Parser Improvement: OCR garbled text (random character sequences) not filtered out

**Severity:** HIGH
**File:** `ocr_correct.py`
**Function:** `correct_ocr_errors`
**Estimated errors fixed:** 2

### Current Behavior

Garbled OCR artifacts like 'nrrFF[iriN]' pass through unfiltered, appearing in the middle of legal text and causing duplicate content around page boundaries.

### Proposed Fix

Add a pattern to detect and remove lines consisting primarily of random character sequences — lines with high density of uppercase consonants, brackets, and no recognizable Indonesian words. Also remove lines with Cyrillic or other non-Latin/non-ASCII characters that are clearly OCR misreads of graphical elements.

### Code Before

```python
    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines
]
```

### Code After

```python
    # Common scanner artifacts
    (re.compile(r'^[;,.]$', re.MULTILINE), ''),  # Lone punctuation on a line
    (re.compile(r'^\s*[-_]{3,}\s*$', re.MULTILINE), ''),  # Horizontal rules from scan lines

    # Garbled OCR lines: random consonant clusters, brackets, no vowels
    # e.g. 'nrrFF[iriN]', 'EtrN', ';*trE'
    (re.compile(r'^\s*[;*]?[a-zA-Z]*(?:[A-Z]{2,}[\[\]()]+[a-zA-Z]*|[\[\]()]+[A-Z]{2,})[a-zA-Z\[\]()]*\s*$', re.MULTILINE), ''),
    (re.compile(r'^\s*\S*[\[\]][A-Za-z]+[\[\]]\S*\s*$', re.MULTILINE), ''),  # Lines with bracketed gibberish

    # Cyrillic and other non-Latin characters from OCR misreading graphics/signatures
    # e.g. Д, Ж, Ѽ, Œ used instead of dates, numbers, or logos
    (re.compile(r'[\u0400-\u04FF\u0500-\u052F\u0152\u0153\u0460-\u047F]'), ''),
]
```

---

_Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parser] OCR garbled text (random character sequences) not filtered out #10

Parser Improvement: OCR garbled text (random character sequences) not filtered out

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Parser] OCR garbled text (random character sequences) not filtered out #10

Description

Parser Improvement: OCR garbled text (random character sequences) not filtered out

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions