[Parser] Page boundary causes duplicate list items with truncated text

## Parser Improvement: Page boundary causes duplicate list items with truncated text

**Severity:** HIGH
**File:** `extract_pymupdf.py`
**Function:** `_dedup_page_breaks`
**Estimated errors fixed:** 2

### Current Behavior

When a page break occurs mid-sentence in a list item (e.g., 'a. keamanan' at end of page, then 'a. keamanan negara...' at start of next page), the overlap detection fails because the text isn't an exact suffix match — the first page has truncated text with trailing artifacts while the next page restarts the item cleanly. This results in duplicate list items with garbled text between them.

### Proposed Fix

Enhance page deduplication to detect partial line overlaps at page boundaries. When the last non-empty line of the previous page starts with a list marker (a., b., etc.) and the first non-empty line of the next page starts with the same marker, treat it as a page-break duplicate and prefer the next page's version.

### Code Before

```python
def _dedup_page_breaks(pages: list[str]) -> str:
    """Join pages while removing duplicated text at page boundaries."""
    if not pages:
        return ""
    result = pages[0]
    for page in pages[1:]:
        overlap = 0
        max_check = min(200, len(result), len(page))
        for length in range(max_check, 10, -1):
            suffix = result[-length:]
            if page.startswith(suffix):
                overlap = length
                break
        if overlap > 0:
            result += page[overlap:]
        else:
            result += '\n' + page
    return result
```

### Code After

```python
_LIST_MARKER_RE_DEDUP = re.compile(r'^\s*(?:[a-z]\.|\d+\.|\(\d+\))\s*')

def _dedup_page_breaks(pages: list[str]) -> str:
    """Join pages while removing duplicated text at page boundaries."""
    if not pages:
        return ""
    result = pages[0]
    for page in pages[1:]:
        overlap = 0
        max_check = min(200, len(result), len(page))
        for length in range(max_check, 10, -1):
            suffix = result[-length:]
            if page.startswith(suffix):
                overlap = length
                break
        if overlap > 0:
            result += page[overlap:]
        else:
            # Check for partial duplicate at page boundary:
            # If last non-blank line of result starts with same list marker
            # as first non-blank line of next page, the page broke mid-item.
            # Remove the truncated trailing line(s) from result.
            result_lines = result.rstrip().split('\n')
            page_lines = page.lstrip().split('\n')
            last_line = result_lines[-1].strip() if result_lines else ''
            first_line = ''
            for pl in page_lines:
                if pl.strip():
                    first_line = pl.strip()
                    break
            last_m = _LIST_MARKER_RE_DEDUP.match(last_line)
            first_m = _LIST_MARKER_RE_DEDUP.match(first_line)
            if (last_m and first_m and
                    last_m.group().strip() == first_m.group().strip() and
                    len(first_line) > len(last_line)):
                # The next page has a more complete version; drop truncated tail
                result = '\n'.join(result_lines[:-1])
                result += '\n' + page
            else:
                result += '\n' + page
    return result
```

---

_Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parser] Page boundary causes duplicate list items with truncated text #12

Parser Improvement: Page boundary causes duplicate list items with truncated text

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Parser] Page boundary causes duplicate list items with truncated text #12

Description

Parser Improvement: Page boundary causes duplicate list items with truncated text

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions