fix(win10): skip POSIX heartbeat watchdog on Windows + sync logging foundation#107
Merged
Merged
Conversation
…ation
THE BUG
=======
projects/keepkey-vault/src/bun/index.ts:4246 unconditionally spawned
`bash -c '<heartbeat watchdog script>'` at module load. On Win10, when
the app is launched from Explorer / Start Menu / installer Run-now,
the bun worker inherits an empty PATH from the parent process.
Bun.spawn delegates to libuv, which fails with UV_ENOENT (-4058) when
it can't locate `bash`, and the failure is delivered as an UNCAUGHT
asynchronous exception in the worker thread several seconds after the
spawn call site. The exception kills the entire app right around the
device pair flow, and the user sees a splash that hangs and then
disappears.
The watchdog is POSIX-only — its script uses bash, kill -9, date +%s,
sleep, cat — and could never have functioned on Windows even if `bash`
were on PATH. It exists to defend against an FFI freeze in
kkemu confirm_helper that only happens on macOS/Linux builds.
This was the dominant Win10 1.2.14 first-launch crash. Reproduced
twice across two different rebuilt 1.2.14 binaries (4f8ec1ba… and
2111ad61…), and verified fixed in-place by patching the installed
bundle and re-launching from the desktop icon.
THE FIX
=======
1. startHeartbeatWatchdog() returns early on process.platform === 'win32'
with a [Vault] Heartbeat watchdog skipped on Windows log line.
2. The Bun.spawn(['bash', ...]) call is wrapped in try/catch as
defense-in-depth on POSIX hosts where bash could conceivably be
missing (containers, minimal NixOS, etc).
3. A long comment block above the function explains the platform
constraint and references this incident, so the next person to
touch the watchdog doesn't accidentally re-enable it on Windows.
OBSERVABILITY (the only reason this was diagnosable)
====================================================
The previous logger used fs.createWriteStream(LOG_FILE, {flags:'a'})
whose buffered .write() calls never reached disk on a worker-thread
crash. Every failed launch left a vault-backend.log that "ended" 2-7
seconds before the actual death point, and we spent two days chasing
wrong root causes (libusb segfault, semver throw, port collision)
because the actual exception line never made it to disk.
This commit replaces the buffered stream with fs.appendFileSync per
log call. Throughput hit is negligible at our log volume (~10-100
lines/sec at peak boot); the upside is that the log file is now a
faithful record of what executed up to a crash.
It also adds a structured boot environment dump immediately after
[Boot] Log file: …, capturing platform, arch, pid, ppid, cwd, argv,
stdio TTY status, PATH.length, LANG, LC_ALL, and Windows-only env
vars (USERNAME, SESSIONNAME, APPDATA, LOCALAPPDATA). The PATH.length
field is what surfaced this bug — Explorer launches show
PATH.length=0, terminal launches show PATH.length=882. Without that
single field, the only evidence was "splash hangs" which is
consistent with twenty different root causes.
Finally, engine-controller.ts gets boundary log lines around every
JS↔native transition in start() and fetchFirmwareManifest() —
usb.on(attach), usb.on(detach), usb.getDeviceList(), mergeManifests,
applyChannel, syncState — each wrapped in try/catch with an explicit
[Engine] FATAL: log so a future libusb segfault leaves a clear
breadcrumb instead of a silent process exit.
WHAT THIS DOES NOT FIX
======================
The three findings from docs/HANDOFF-1.2.14-WINDOWS-PAIR.md remain:
- Finding 1 (Invalid Version: vundefined.undefined.undefined) is
addressed by keepkey/hdwallet#37 which still needs review/merge,
followed by a submodule pointer bump here.
- Finding 2 (native crash on USB unplug during pairRawDevice) is
separate from this PR's bug. With the sync logger landing, the
next reproduction should leave the actual death cause in the log.
- Finding 3 (port-1646 collision splash hang) is already fixed in
the official 1.2.14 rebuild from 2026-04-09 — verified working.
See docs/HANDOFF-1.2.14-WIN10-WATCHDOG-CRASH.md for the full
diagnosis story, including the verification recipe.
BitHighlander
added a commit
that referenced
this pull request
Apr 9, 2026
Bump version to 1.2.15 and pin hdwallet submodule to master (includes keepkey/hdwallet#37 — Features version validation). Changes since v1.2.14 pre-release: - fix(win10): skip POSIX heartbeat watchdog on Windows (#107) - fix: sync file logger — crash logs now survive native exceptions (#107) - fix: boot environment dump for launch-context diagnostics (#107) - fix: JS↔native boundary logging in engine-controller (#107) - fix: port 1646 collision check before window creation (#106) - fix: USB detach race guard on WebUSB pair (#106) - fix: hdwallet Features version validation (hdwallet#37) - docs: handoff for 1.2.14 Windows pair failures (#105) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the dominant 1.2.14 Win10 first-launch crash and lands the observability foundation that made it possible to find.
startHeartbeatWatchdog()returns early onprocess.platform === 'win32'bash -c '<POSIX shell script>'which fails withUV_ENOENT(-4058) on Win10 Explorer launches (empty PATH), causing an uncaught async exception in the worker that kills the entire app right around device pair time.fs.appendFileSyncper call)createWriteStreambuffered writes — every crashed launch lost the actual exception line, which is why this bug went undiagnosed for two days.platform,pid,ppid,cwd,argv, stdio TTY status,PATH.length, Windows env vars. ThePATH.length=0field is what surfaced this bug.engine-controller.tsVerified fixed in-place by patching the installed bundle and re-launching from the desktop icon. Same install crashed without the fix; with the fix, the boot reaches
[PERF] +3876ms: boot completeand the device pairs cleanly.The bug
`projects/keepkey-vault/src/bun/index.ts:4246` unconditionally spawns
bash -c '<heartbeat watchdog script>'at module load. The watchdog is POSIX-only (usesbash,kill -9,date +%s,sleep,cat,[ -f ]) and could never have functioned on Windows even ifbashwere on PATH. It exists to defend against an FFI freeze inkkemu confirm_helperthat only happens on macOS/Linux builds.When the app is launched from Explorer / Start Menu / installer-Run-now on Win10, the bun worker inherits an empty
PATHfrom the parent process.Bun.spawndelegates to libuv, which fails withUV_ENOENTwhen it can't locatebash, and the failure is delivered as an uncaught async exception in the worker thread several seconds after the spawn call site. The exception kills the entire app right around[Engine] State → needs_pin, and the user sees a splash that hangs and then disappears.Reproduced twice across two distinct rebuilt 1.2.14 binaries:
Why it took two days
The previous file logger used `fs.createWriteStream(LOG_FILE, {flags:'a'})` whose buffered `.write()` calls were never flushed when the worker thread crashed. Every failed launch left a `vault-backend.log` that appeared to end at `[Engine] Merged manifest` — actually 2-7 seconds before the actual death point. We chased three wrong root causes (libusb segfault on detach, semver throw in initialize, port-1646 collision) every one of which was downstream of a buffered log losing the actual exception line.
The first commit on this branch swaps the buffered stream for `fs.appendFileSync`. The very next failed launch produced this final line in `vault-backend.log`, which is the smoking gun:
```
[2026-04-09T22:01:38.479Z] ERR: Uncaught exception in worker: {
"code": "ENOENT",
"path": "bash",
"errno": -4058
}
```
Combined with `[Boot] env: PATH.length=0 LANG=` from the new boot env dump, the diagnosis was a 30-second `grep` for `'bash'` in `src/bun/`.
Verification
Before the fix (orig 1.2.14, Explorer launch with device plugged in)
```
[2026-04-09T22:01:38.472Z] [Engine] State → needs_pin
[2026-04-09T22:01:38.472Z] [Engine] Resolved BL hash fe98454e… → v2.1.4
[2026-04-09T22:01:38.479Z] ERR: Uncaught exception in worker: { code: 'ENOENT', path: 'bash', errno: -4058 }
[silence — process gone]
```
After the fix (same install, Explorer launch, same device)
```
[Boot] env: PATH.length=0 LANG= ← Explorer launch, empty PATH
[Vault] Heartbeat watchdog skipped on Windows (POSIX-only) ← the fix fired
[Engine] Merged manifest: latest=v7.10.0 beta=v7.14.0
[Engine] Firmware manifest (latest): fw=7.10.0 bl=2.1.4
[Engine] Scanning for WebUSB device...
[Engine] WebUSB device found, attempting pairRawDevice...
[Engine] Paired via WebUSB
[Engine] Initializing device...
[Engine] Features: { ... firmwareVersion: 7.14.0, ... }
[Engine] State → needs_pin
[PERF] +3876ms: boot complete ← clean boot
[Engine] PIN_REQUEST → type=current
[Engine] State → ready ← device paired and unlocked
```
What this PR does NOT fix
The three Win10 findings from `docs/HANDOFF-1.2.14-WINDOWS-PAIR.md` remain:
Files
Test plan