Skip to content

[Parser] Page boundary causes duplicate list items with truncated text #12

@ilhamfp

Description

@ilhamfp

Parser Improvement: Page boundary causes duplicate list items with truncated text

Severity: HIGH
File: extract_pymupdf.py
Function: _dedup_page_breaks
Estimated errors fixed: 2

Current Behavior

When a page break occurs mid-sentence in a list item (e.g., 'a. keamanan' at end of page, then 'a. keamanan negara...' at start of next page), the overlap detection fails because the text isn't an exact suffix match — the first page has truncated text with trailing artifacts while the next page restarts the item cleanly. This results in duplicate list items with garbled text between them.

Proposed Fix

Enhance page deduplication to detect partial line overlaps at page boundaries. When the last non-empty line of the previous page starts with a list marker (a., b., etc.) and the first non-empty line of the next page starts with the same marker, treat it as a page-break duplicate and prefer the next page's version.

Code Before

def _dedup_page_breaks(pages: list[str]) -> str:
    """Join pages while removing duplicated text at page boundaries."""
    if not pages:
        return ""
    result = pages[0]
    for page in pages[1:]:
        overlap = 0
        max_check = min(200, len(result), len(page))
        for length in range(max_check, 10, -1):
            suffix = result[-length:]
            if page.startswith(suffix):
                overlap = length
                break
        if overlap > 0:
            result += page[overlap:]
        else:
            result += '\n' + page
    return result

Code After

_LIST_MARKER_RE_DEDUP = re.compile(r'^\s*(?:[a-z]\.|\d+\.|\(\d+\))\s*')

def _dedup_page_breaks(pages: list[str]) -> str:
    """Join pages while removing duplicated text at page boundaries."""
    if not pages:
        return ""
    result = pages[0]
    for page in pages[1:]:
        overlap = 0
        max_check = min(200, len(result), len(page))
        for length in range(max_check, 10, -1):
            suffix = result[-length:]
            if page.startswith(suffix):
                overlap = length
                break
        if overlap > 0:
            result += page[overlap:]
        else:
            # Check for partial duplicate at page boundary:
            # If last non-blank line of result starts with same list marker
            # as first non-blank line of next page, the page broke mid-item.
            # Remove the truncated trailing line(s) from result.
            result_lines = result.rstrip().split('\n')
            page_lines = page.lstrip().split('\n')
            last_line = result_lines[-1].strip() if result_lines else ''
            first_line = ''
            for pl in page_lines:
                if pl.strip():
                    first_line = pl.strip()
                    break
            last_m = _LIST_MARKER_RE_DEDUP.match(last_line)
            first_m = _LIST_MARKER_RE_DEDUP.match(first_line)
            if (last_m and first_m and
                    last_m.group().strip() == first_m.group().strip() and
                    len(first_line) > len(last_line)):
                # The next page has a more complete version; drop truncated tail
                result = '\n'.join(result_lines[:-1])
                result += '\n' + page
            else:
                result += '\n' + page
    return result

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions