Skip to content

feat(module-postgres): early warning and mitigation for WAL slot invalidation during snapshot#554

Merged
Sleepful merged 12 commits intopowersync-ja:mainfrom
Sleepful:wal-slot-invalidation
Apr 8, 2026
Merged

feat(module-postgres): early warning and mitigation for WAL slot invalidation during snapshot#554
Sleepful merged 12 commits intopowersync-ja:mainfrom
Sleepful:wal-slot-invalidation

Conversation

@Sleepful
Copy link
Copy Markdown
Contributor

@Sleepful Sleepful commented Mar 7, 2026

Summary

  • Detect replication slot invalidation mid-snapshot and abort early instead of completing a doomed hours-long snapshot
  • Log WAL budget consumption with rate and ETA during snapshot — warn at 50% remaining
  • Block futile replication retries when slot is lost during snapshot (retrying would repeat the same long snapshot)
  • Surface actionable error messages with error code PSYNC_S1146 and docs link
  • Expose WAL budget fields in the diagnostics API for dashboard/operator visibility

Problem

When PowerSync runs a full snapshot (triggered by sync rule changes), the replication slot can be silently invalidated if WAL growth exceeds max_slot_wal_keep_size. The operator only discovers this after the snapshot completes and streaming fails — often hours or days later. The fix is simple (increase the limit) but discovery is painful.

Changes

Mid-snapshot slot health check

After each chunk flush in snapshotTable(), query pg_replication_slots to check if the slot is still valid. If wal_status = 'lost', abort immediately with an enriched MissingReplicationSlotError carrying walStatus and phase fields.

The check hits shared memory (~1-2ms per round-trip), negligible next to per-chunk storage flush.

Conditional retry

New shouldRetryReplication() function controls whether WalStreamReplicationJob retries after slot invalidation:

Condition Action
Slot lost during snapshot Block retry — would repeat the same long snapshot
Slot lost during streaming Allow retry — streaming invalidation is often transient
Slot missing Allow retry — may have been dropped externally
invalidation_reason = rows_removed Allow retry — not a WAL budget issue

WAL budget reporting

Time-throttled logging (every 2 min) of WAL budget during snapshot:

  • Budget remaining (bytes and %)
  • Consumption rate (GB/hr) computed from successive samples
  • ETA to exhaustion
  • Warning at 50% remaining

Computation extracted into exported pure functions (computeWalBudgetReport, formatWalBudgetLine, formatBytes, formatDuration) for testability.

Actionable error messages

Slot invalidation errors now include:

  • Error code PSYNC_S1146
  • Fix guidance ("Increase max_slot_wal_keep_size on the source database")
  • Docs link (docs.powersync.com/self-hosting/troubleshooting/replication-slot-invalidated)
  • Observed WAL budget context when available (limit, time to exhaustion)

Applied to both checkSlotHealth() (mid-snapshot) and initSlot() (between replication cycles).

Diagnostics API

Three new optional fields on the SyncRulesStatus connection object:

  • wal_status — slot status from pg_replication_slots (PG 13+)
  • safe_wal_size — bytes remaining before potential invalidation
  • max_slot_wal_keep_size — configured limit in bytes

New optional getSlotWalBudget() on RouteAPI, implemented by PostgresRouteAPIAdapter. Non-Postgres adapters are unchanged.

Bug fix

withMaxWalSize() test helper was ignoring its size parameter and hardcoding '100MB'. Now uses the parameter.

Files changed

File Change
modules/module-postgres/src/replication/WalStream.ts MissingReplicationSlotError enrichment, checkSlotHealth(), WAL budget tracking, shouldRetryReplication(), formatWalBudgetContext(), pure functions
modules/module-postgres/src/replication/WalStreamReplicationJob.ts Conditional retry wiring
modules/module-postgres/src/api/PostgresRouteAPIAdapter.ts getSlotWalBudget() implementation
packages/service-core/src/api/RouteAPI.ts SlotWalBudgetInfo interface, optional method
packages/service-core/src/api/diagnostics.ts WAL budget fields in diagnostics response
packages/service-errors/src/codes.ts PSYNC_S1146 error code
packages/types/src/definitions.ts wal_status, safe_wal_size, max_slot_wal_keep_size fields
modules/module-postgres/test/src/wal_stream.test.ts 2 integration tests (slot loss detection + diagnostic context)
modules/module-postgres/test/src/replication_retry.test.ts 4 pure function tests for retry logic
modules/module-postgres/test/src/wal_budget.test.ts 17 pure function tests for budget computation/formatting
modules/module-postgres/test/src/wal_budget_api.test.ts 3 integration tests for diagnostics adapter
modules/module-postgres/test/src/wal_stream_utils.ts withMaxWalSize() bug fix

Testing

  • 26 new tests total (4 retry + 17 budget + 3 diagnostics API + 2 integration)
  • All existing tests pass
  • Pure function tests run without Postgres in milliseconds
  • Integration tests use onSnapshotChunkFlushed hook for deterministic WAL generation (no race conditions)

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 7, 2026

🦋 Changeset detected

Latest commit: aed41f0

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 19 packages
Name Type
@powersync/service-types Patch
@powersync/service-core Patch
@powersync/service-module-postgres Patch
@powersync/service-errors Patch
@powersync/service-schema Patch
@powersync/service-client Patch
@powersync/lib-service-postgres Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-core-tests Patch
@powersync/service-image Patch
test-client Patch
@powersync/lib-services-framework Patch
@powersync/service-rsocket-router Patch
@powersync/lib-service-mongodb Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@Sleepful Sleepful force-pushed the wal-slot-invalidation branch 2 times, most recently from 4cd30c1 to bc1a9d1 Compare March 11, 2026 07:51
@Sleepful Sleepful changed the title test(module-postgres): add failing tests for mid-snapshot WAL slot invalidation feat(module-postgres): early warning and mitigation for WAL slot invalidation during snapshot Mar 11, 2026
@Sleepful Sleepful marked this pull request as ready for review March 11, 2026 10:08
@Sleepful
Copy link
Copy Markdown
Contributor Author

Sleepful commented Apr 3, 2026

Addressing review feedback

Two new commits addressing all review comments:

Error class refactor:

  • walStatus and phase are now required on MissingReplicationSlotError — no more optional fields with guessed defaults
  • walStatus is now a WalStatus union type ('reserved' | 'extended' | 'unreserved' | 'lost' | 'missing') instead of string
  • Added invalidationReason field (passed through from PG 14+ pg_replication_slots)
  • Removed SlotInvalidationContext interface — shouldRetryReplication() now takes the error object directly, eliminating the (e as any) cast and default values in WalStreamReplicationJob
  • Extracted MissingReplicationSlotError.ts and wal-budget-utils.ts to reduce WalStream.ts size — barrel exports from new files in replication-index.ts

Cleanup:

  • Removed docs URLs from all error messages — kept [PSYNC_S1146] error code only (matching codebase pattern)
  • Throttled checkSlotHealth() itself (renamed walBudgetLogIntervalMs → slotHealthCheckIntervalMs, 2-min default). Integration tests override to 0 for per-chunk checking.
  • Fixed the misleading "exhausted in" label in formatWalBudgetContext() — was showing the log interval, not an ETA. Simplified to just show the WAL limit.
    Note on initSlot() throw sites: these use phase: 'streaming' (not 'snapshot') because initSlot() runs at replication startup, not mid-snapshot. A lost slot found at startup is a pre-existing condition where retry is appropriate. Only checkSlotHealth() sets phase: 'snapshot' to block retry during active snapshots.

@Sleepful Sleepful force-pushed the wal-slot-invalidation branch 4 times, most recently from ced10a2 to 26fddf6 Compare April 3, 2026 08:07
@Sleepful
Copy link
Copy Markdown
Contributor Author

Sleepful commented Apr 3, 2026

fwiw flaky CI test, retried twice to make it pass:

https://github.com/powersync-ja/powersync-service/actions/runs/23939262388/job/69821896741

FAIL  test/src/BinLogListener.test.ts > BinlogListener tests > Multi database events
Error: Timeout while waiting for [1] schema changes.
 ❯ waitForSchemaChanges test/src/BinLogListener.test.ts:478:13
    476|       await vi.waitFor(() => expect(eventHandler.schemaChanges.length)…
    477|     } catch (error) {
    478|       throw new Error(`Timeout while waiting for [${count}] schema cha…
       |             ^
    479|     }
    480|   }

@Sleepful Sleepful requested a review from rkistner April 3, 2026 08:31
@Sleepful Sleepful force-pushed the wal-slot-invalidation branch from 1ed1aa5 to 89bc1cb Compare April 8, 2026 05:47
Sleepful added 10 commits April 8, 2026 16:37
…validation

Add onSnapshotChunkFlushed hook to WalStreamOptions for deterministic
test-time WAL generation during snapshot chunk processing. Two new tests:

- "slot lost during snapshot aborts early" — verifies snapshot aborts when
  slot is invalidated mid-flight (currently fails: no slot health check)
- "slot invalidation error carries diagnostic context" — verifies error
  carries walStatus/phase fields (currently fails: properties not yet added)

Both tests are expected to fail until the detection feature is implemented.
…unit tests

SlotInvalidationContext interface and shouldRetryReplication() stub (always
returns true) added to WalStream.ts. Four pure unit tests cover the retry
decision matrix: block retry when slot lost during snapshot, allow retry
for streaming/missing/rows_removed. Test 1 fails as expected (stub returns
true, test expects false). Implementation comes in a later commit.
Add checkSlotHealth() to WalStream that queries pg_replication_slots
after each chunk flush. Throws MissingReplicationSlotError with phase
and walStatus metadata so shouldRetryReplication() can distinguish
snapshot-phase invalidation (non-retryable) from streaming (retryable).
…idation

Block replication retry when slot is invalidated (wal_status=lost) during
snapshot phase, since retrying would repeat the same long snapshot and
likely fail again. Allow retry for streaming phase, missing slots, and
rows_removed invalidation reason.
Time-throttled logging of WAL budget remaining, consumption rate, and ETA
during snapshot. Warns at 50% budget. Exported pure functions for budget
computation and formatting. Fixed PG <13 compat in checkSlotHealth() query.
…r messages

Add error code, fix guidance, and docs link to slot invalidation errors
in checkSlotHealth() and initSlot(). Include observed WAL budget context
when available. Add test assertions for error code and docs link in both
checkSlotHealth and initSlot error paths.
Add wal_status, safe_wal_size, and max_slot_wal_keep_size to the
SyncRulesStatus connection object. New optional getSlotWalBudget() on
RouteAPI, implemented by PostgresRouteAPIAdapter. Fix withMaxWalSize()
test helper to use its size parameter instead of hardcoding 100MB.
…dation reason

idle_timeout suggests increasing idle_replication_slot_timeout;
other reasons still suggest max_slot_wal_keep_size.
…quired and extract to separate files

Make walStatus and phase required constructor parameters, add
invalidationReason field. Remove SlotInvalidationContext interface and
pass error directly to shouldRetryReplication(). Move error class to
MissingReplicationSlotError.ts and utility functions to wal-budget-utils.ts.
Sleepful added 2 commits April 8, 2026 16:37
…Ls, fix budget context

Throttle checkSlotHealth() call in snapshotTable() instead of only
throttling the budget log inside it. Remove docs URLs from error messages
(codebase pattern is error code only). Simplify formatWalBudgetContext()
to show only the WAL limit, removing misleading elapsed-time field.
… ReplicationPhase union types

Extract ReplicationPhase type alias with JSDoc explaining retry
semantics for each phase. Narrows the phase field from bare string
literals to the named type for better discoverability and type safety.
@Sleepful Sleepful force-pushed the wal-slot-invalidation branch from 89bc1cb to aed41f0 Compare April 8, 2026 22:40
@Sleepful Sleepful merged commit 41875f7 into powersync-ja:main Apr 8, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants