Parser Improvement: Page boundary causes duplicate list items with truncated text
Severity: HIGH
File: extract_pymupdf.py
Function: _dedup_page_breaks
Estimated errors fixed: 2
Current Behavior
When a page break occurs mid-sentence in a list item (e.g., 'a. keamanan' at end of page, then 'a. keamanan negara...' at start of next page), the overlap detection fails because the text isn't an exact suffix match — the first page has truncated text with trailing artifacts while the next page restarts the item cleanly. This results in duplicate list items with garbled text between them.
Proposed Fix
Enhance page deduplication to detect partial line overlaps at page boundaries. When the last non-empty line of the previous page starts with a list marker (a., b., etc.) and the first non-empty line of the next page starts with the same marker, treat it as a page-break duplicate and prefer the next page's version.
Code Before
def _dedup_page_breaks(pages: list[str]) -> str:
"""Join pages while removing duplicated text at page boundaries."""
if not pages:
return ""
result = pages[0]
for page in pages[1:]:
overlap = 0
max_check = min(200, len(result), len(page))
for length in range(max_check, 10, -1):
suffix = result[-length:]
if page.startswith(suffix):
overlap = length
break
if overlap > 0:
result += page[overlap:]
else:
result += '\n' + page
return result
Code After
_LIST_MARKER_RE_DEDUP = re.compile(r'^\s*(?:[a-z]\.|\d+\.|\(\d+\))\s*')
def _dedup_page_breaks(pages: list[str]) -> str:
"""Join pages while removing duplicated text at page boundaries."""
if not pages:
return ""
result = pages[0]
for page in pages[1:]:
overlap = 0
max_check = min(200, len(result), len(page))
for length in range(max_check, 10, -1):
suffix = result[-length:]
if page.startswith(suffix):
overlap = length
break
if overlap > 0:
result += page[overlap:]
else:
# Check for partial duplicate at page boundary:
# If last non-blank line of result starts with same list marker
# as first non-blank line of next page, the page broke mid-item.
# Remove the truncated trailing line(s) from result.
result_lines = result.rstrip().split('\n')
page_lines = page.lstrip().split('\n')
last_line = result_lines[-1].strip() if result_lines else ''
first_line = ''
for pl in page_lines:
if pl.strip():
first_line = pl.strip()
break
last_m = _LIST_MARKER_RE_DEDUP.match(last_line)
first_m = _LIST_MARKER_RE_DEDUP.match(first_line)
if (last_m and first_m and
last_m.group().strip() == first_m.group().strip() and
len(first_line) > len(last_line)):
# The next page has a more complete version; drop truncated tail
result = '\n'.join(result_lines[:-1])
result += '\n' + page
else:
result += '\n' + page
return result
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.
Parser Improvement: Page boundary causes duplicate list items with truncated text
Severity: HIGH
File:
extract_pymupdf.pyFunction:
_dedup_page_breaksEstimated errors fixed: 2
Current Behavior
When a page break occurs mid-sentence in a list item (e.g., 'a. keamanan' at end of page, then 'a. keamanan negara...' at start of next page), the overlap detection fails because the text isn't an exact suffix match — the first page has truncated text with trailing artifacts while the next page restarts the item cleanly. This results in duplicate list items with garbled text between them.
Proposed Fix
Enhance page deduplication to detect partial line overlaps at page boundaries. When the last non-empty line of the previous page starts with a list marker (a., b., etc.) and the first non-empty line of the next page starts with the same marker, treat it as a page-break duplicate and prefer the next page's version.
Code Before
Code After
Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 4 parser feedback entries.