feat(module-postgres): early warning and mitigation for WAL slot invalidation during snapshot#554
Merged
Sleepful merged 12 commits intopowersync-ja:mainfrom Apr 8, 2026
Merged
Conversation
🦋 Changeset detectedLatest commit: aed41f0 The changes in this PR will be included in the next version bump. This PR includes changesets to release 19 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
4cd30c1 to
bc1a9d1
Compare
Sleepful
commented
Mar 11, 2026
Sleepful
commented
Mar 11, 2026
Sleepful
commented
Mar 11, 2026
rkistner
reviewed
Mar 16, 2026
modules/module-postgres/src/replication/WalStreamReplicationJob.ts
Outdated
Show resolved
Hide resolved
Contributor
Author
|
Addressing review feedback Two new commits addressing all review comments: Error class refactor:
Cleanup:
|
ced10a2 to
26fddf6
Compare
Sleepful
commented
Apr 3, 2026
Contributor
Author
fwiw flaky CI test, retried twice to make it pass:https://github.com/powersync-ja/powersync-service/actions/runs/23939262388/job/69821896741 |
1ed1aa5 to
89bc1cb
Compare
rkistner
approved these changes
Apr 8, 2026
…validation Add onSnapshotChunkFlushed hook to WalStreamOptions for deterministic test-time WAL generation during snapshot chunk processing. Two new tests: - "slot lost during snapshot aborts early" — verifies snapshot aborts when slot is invalidated mid-flight (currently fails: no slot health check) - "slot invalidation error carries diagnostic context" — verifies error carries walStatus/phase fields (currently fails: properties not yet added) Both tests are expected to fail until the detection feature is implemented.
…unit tests SlotInvalidationContext interface and shouldRetryReplication() stub (always returns true) added to WalStream.ts. Four pure unit tests cover the retry decision matrix: block retry when slot lost during snapshot, allow retry for streaming/missing/rows_removed. Test 1 fails as expected (stub returns true, test expects false). Implementation comes in a later commit.
Add checkSlotHealth() to WalStream that queries pg_replication_slots after each chunk flush. Throws MissingReplicationSlotError with phase and walStatus metadata so shouldRetryReplication() can distinguish snapshot-phase invalidation (non-retryable) from streaming (retryable).
…idation Block replication retry when slot is invalidated (wal_status=lost) during snapshot phase, since retrying would repeat the same long snapshot and likely fail again. Allow retry for streaming phase, missing slots, and rows_removed invalidation reason.
Time-throttled logging of WAL budget remaining, consumption rate, and ETA during snapshot. Warns at 50% budget. Exported pure functions for budget computation and formatting. Fixed PG <13 compat in checkSlotHealth() query.
…r messages Add error code, fix guidance, and docs link to slot invalidation errors in checkSlotHealth() and initSlot(). Include observed WAL budget context when available. Add test assertions for error code and docs link in both checkSlotHealth and initSlot error paths.
Add wal_status, safe_wal_size, and max_slot_wal_keep_size to the SyncRulesStatus connection object. New optional getSlotWalBudget() on RouteAPI, implemented by PostgresRouteAPIAdapter. Fix withMaxWalSize() test helper to use its size parameter instead of hardcoding 100MB.
…dation reason idle_timeout suggests increasing idle_replication_slot_timeout; other reasons still suggest max_slot_wal_keep_size.
…quired and extract to separate files Make walStatus and phase required constructor parameters, add invalidationReason field. Remove SlotInvalidationContext interface and pass error directly to shouldRetryReplication(). Move error class to MissingReplicationSlotError.ts and utility functions to wal-budget-utils.ts.
…Ls, fix budget context Throttle checkSlotHealth() call in snapshotTable() instead of only throttling the budget log inside it. Remove docs URLs from error messages (codebase pattern is error code only). Simplify formatWalBudgetContext() to show only the WAL limit, removing misleading elapsed-time field.
… ReplicationPhase union types Extract ReplicationPhase type alias with JSDoc explaining retry semantics for each phase. Narrows the phase field from bare string literals to the named type for better discoverability and type safety.
89bc1cb to
aed41f0
Compare
Sleepful
commented
Apr 8, 2026
Sleepful
commented
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PSYNC_S1146and docs linkProblem
When PowerSync runs a full snapshot (triggered by sync rule changes), the replication slot can be silently invalidated if WAL growth exceeds
max_slot_wal_keep_size. The operator only discovers this after the snapshot completes and streaming fails — often hours or days later. The fix is simple (increase the limit) but discovery is painful.Changes
Mid-snapshot slot health check
After each chunk flush in
snapshotTable(), querypg_replication_slotsto check if the slot is still valid. Ifwal_status = 'lost', abort immediately with an enrichedMissingReplicationSlotErrorcarryingwalStatusandphasefields.The check hits shared memory (~1-2ms per round-trip), negligible next to per-chunk storage flush.
Conditional retry
New
shouldRetryReplication()function controls whetherWalStreamReplicationJobretries after slot invalidation:invalidation_reason = rows_removedWAL budget reporting
Time-throttled logging (every 2 min) of WAL budget during snapshot:
Computation extracted into exported pure functions (
computeWalBudgetReport,formatWalBudgetLine,formatBytes,formatDuration) for testability.Actionable error messages
Slot invalidation errors now include:
PSYNC_S1146docs.powersync.com/self-hosting/troubleshooting/replication-slot-invalidated)Applied to both
checkSlotHealth()(mid-snapshot) andinitSlot()(between replication cycles).Diagnostics API
Three new optional fields on the
SyncRulesStatusconnection object:wal_status— slot status frompg_replication_slots(PG 13+)safe_wal_size— bytes remaining before potential invalidationmax_slot_wal_keep_size— configured limit in bytesNew optional
getSlotWalBudget()onRouteAPI, implemented byPostgresRouteAPIAdapter. Non-Postgres adapters are unchanged.Bug fix
withMaxWalSize()test helper was ignoring itssizeparameter and hardcoding'100MB'. Now uses the parameter.Files changed
modules/module-postgres/src/replication/WalStream.tsMissingReplicationSlotErrorenrichment,checkSlotHealth(), WAL budget tracking,shouldRetryReplication(),formatWalBudgetContext(), pure functionsmodules/module-postgres/src/replication/WalStreamReplicationJob.tsmodules/module-postgres/src/api/PostgresRouteAPIAdapter.tsgetSlotWalBudget()implementationpackages/service-core/src/api/RouteAPI.tsSlotWalBudgetInfointerface, optional methodpackages/service-core/src/api/diagnostics.tspackages/service-errors/src/codes.tsPSYNC_S1146error codepackages/types/src/definitions.tswal_status,safe_wal_size,max_slot_wal_keep_sizefieldsmodules/module-postgres/test/src/wal_stream.test.tsmodules/module-postgres/test/src/replication_retry.test.tsmodules/module-postgres/test/src/wal_budget.test.tsmodules/module-postgres/test/src/wal_budget_api.test.tsmodules/module-postgres/test/src/wal_stream_utils.tswithMaxWalSize()bug fixTesting
onSnapshotChunkFlushedhook for deterministic WAL generation (no race conditions)