Skip to content

fix(shim): hold registry lock across load-modify-save to prevent pane ID races#640

Open
Wirasm wants to merge 2 commits intomainfrom
fix/shim-registry-lock
Open

fix(shim): hold registry lock across load-modify-save to prevent pane ID races#640
Wirasm wants to merge 2 commits intomainfrom
fix/shim-registry-lock

Conversation

@Wirasm
Copy link
Owner

@Wirasm Wirasm commented Mar 19, 2026

Summary

  • Add LockedRegistry guard type in state.rs holding both Flock<File> and PaneRegistry
  • load_and_lock() acquires exclusive flock + reads registry atomically, returns guard
  • save(self) writes while lock is still held, then consumes guard (lock drops)
  • Update all 8 callers in commands.rs: handle_split_window, handle_kill_pane, handle_select_pane, handle_set_option, handle_new_session, handle_new_window, handle_break_pane, handle_join_pane
  • Add thread-level concurrency test verifying unique pane ID allocation under parallel access

Test plan

  • cargo fmt --check passes
  • cargo clippy --all -- -D warnings passes
  • cargo test -p kild-tmux-shim passes (including new concurrency test)
  • cargo build --all passes

… ID races

state::load() and state::save() each acquired independent flocks. Between
the two calls, a concurrent split-window could read the same next_pane_id,
causing duplicate pane IDs and orphaned daemon PTYs.

Add LockedRegistry guard type that holds both the Flock and PaneRegistry.
save(self) writes while the lock is still held. Update all 8 callers in
commands.rs to use load_and_lock() + locked.save(). Add thread-level
concurrency test verifying unique pane ID allocation.
@Wirasm
Copy link
Owner Author

Wirasm commented Mar 19, 2026

PR Review Summary

Critical Issues (1 found)

Agent Issue Location
code-reviewer Build fails: cargo clippy -- -D warnings rejects dead load() function. After migrating all 8 callers in commands.rs from load()/save() to load_and_lock()/locked.save(), the standalone load() function is no longer called from production code (only tests and init_registry()). Clippy's -D dead-code flags this. Fix: add #[cfg(test)] to load() if test-only, or #[allow(dead_code)] like init_registry() already has. Same may apply to standalone save() -- it's only called from init_registry() which is itself #[allow(dead_code)]. crates/kild-tmux-shim/src/state.rs:181

Important Issues (1 found)

Agent Issue Location
type-design-analyzer LockedRegistry.save() writes to a new file handle, not the locked one. save(self) calls fs::File::create(&data_path) which opens a new file descriptor for writing, while the flock is held on a separate lock file. This is fine for correctness since flock is on registry.lock not panes.json. However, the write is not atomic -- a crash between File::create (which truncates) and write_all would leave a truncated/empty panes.json. Consider write to a temp file + rename for crash safety. This is pre-existing behavior from the old save() and not a regression, but worth noting since this PR elevates the locking story. crates/kild-tmux-shim/src/state.rs:238-252

Suggestions (3 found)

Agent Suggestion Location
code-simplifier handle_new_session mixes registry() and registry_mut() calls. Lines that read locked.registry().sessions.len() followed by locked.registry_mut().windows.insert(...) could be simplified by doing all reads first, then the mutable phase. Current code is correct but the interleaving makes it harder to follow. commands.rs:665-699
code-simplifier handle_split_window pre-captures format values into a tuple before save(). This is necessary because save(self) consumes the guard, but the tuple unpacking adds cognitive overhead. The pattern is clean and well-commented -- just noting it as a maintainability consideration. No action needed. commands.rs:316-336
docs-impact-agent Module doc comment labels load() + save() as "Legacy writes". This is accurate but could be confusing if someone later adds #[cfg(test)] to gate them. If they become test-only, the doc should reflect that. state.rs:11-12

Strengths

  • Clean RAII guard pattern -- LockedRegistry consumes-on-save is idiomatic Rust and prevents accidental lock-then-forget
  • The concurrency test (test_load_and_lock_concurrent_allocations_unique) directly validates the race condition this PR fixes with 8 parallel threads
  • All 8 callers in commands.rs were migrated consistently with no behavioral changes beyond the lock discipline improvement
  • Doc comment at module top updated to clearly describe the three locking tiers

Documentation Issues

  • CLAUDE.md mentions flock-based concurrency control at a high level -- no update needed for this internal API change
  • No user-facing docs affected

Test Coverage

  • New test_load_and_lock_basic_roundtrip covers the happy path
  • New test_load_and_lock_concurrent_allocations_unique covers the core race condition with 8 threads
  • Existing tests for load()/save() still exercise the legacy path
  • Gap: No test for LockedRegistry dropped without calling save() (verifying lock release on drop). Low risk since this delegates to Flock's Drop impl.

Error Handling

  • All error paths in LockedRegistry::save() properly propagate via ? -- no silent failures
  • Error messages are consistent with existing patterns ("failed to serialize", "failed to write")
  • No new unwrap()/expect() in production code

Verdict

NEEDS FIX -- The clippy dead-code error on load() must be resolved before merge. This is a one-line fix (#[allow(dead_code)] or #[cfg(test)]). After that fix, the PR is clean and ready.

Recommended Actions

  1. Fix the load() dead-code clippy error (critical -- build gate fails)
  2. Consider same treatment for standalone save() (may also trigger once init_registry is gated)
  3. Run cargo clippy --all -- -D warnings and cargo test -p kild-tmux-shim to verify

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant