Skip to content

fix(win10): skip POSIX heartbeat watchdog on Windows + sync logging foundation#107

Merged
BitHighlander merged 1 commit into
developfrom
feat/observability-foundation
Apr 9, 2026
Merged

fix(win10): skip POSIX heartbeat watchdog on Windows + sync logging foundation#107
BitHighlander merged 1 commit into
developfrom
feat/observability-foundation

Conversation

@BitHighlander
Copy link
Copy Markdown
Collaborator

Summary

Fixes the dominant 1.2.14 Win10 first-launch crash and lands the observability foundation that made it possible to find.

Change Why
startHeartbeatWatchdog() returns early on process.platform === 'win32' The watchdog spawns bash -c '<POSIX shell script>' which fails with UV_ENOENT (-4058) on Win10 Explorer launches (empty PATH), causing an uncaught async exception in the worker that kills the entire app right around device pair time.
Sync logger (fs.appendFileSync per call) The previous createWriteStream buffered writes — every crashed launch lost the actual exception line, which is why this bug went undiagnosed for two days.
Boot environment dump Captures platform, pid, ppid, cwd, argv, stdio TTY status, PATH.length, Windows env vars. The PATH.length=0 field is what surfaced this bug.
JS↔native boundary logging in engine-controller.ts Every USB / libusb call now logs entry/exit with try/catch, so future native crashes leave clear breadcrumbs.

Verified fixed in-place by patching the installed bundle and re-launching from the desktop icon. Same install crashed without the fix; with the fix, the boot reaches [PERF] +3876ms: boot complete and the device pairs cleanly.

The bug

`projects/keepkey-vault/src/bun/index.ts:4246` unconditionally spawns bash -c '<heartbeat watchdog script>' at module load. The watchdog is POSIX-only (uses bash, kill -9, date +%s, sleep, cat, [ -f ]) and could never have functioned on Windows even if bash were on PATH. It exists to defend against an FFI freeze in kkemu confirm_helper that only happens on macOS/Linux builds.

When the app is launched from Explorer / Start Menu / installer-Run-now on Win10, the bun worker inherits an empty PATH from the parent process. Bun.spawn delegates to libuv, which fails with UV_ENOENT when it can't locate bash, and the failure is delivered as an uncaught async exception in the worker thread several seconds after the spawn call site. The exception kills the entire app right around [Engine] State → needs_pin, and the user sees a splash that hangs and then disappears.

Reproduced twice across two distinct rebuilt 1.2.14 binaries:

  • `KeepKey-Vault-1.2.14-win-x64-setup.exe` SHA256 `4f8ec1ba…` (2026-04-08 first publish)
  • `KeepKey-Vault-1.2.14-win-x64-setup.exe` SHA256 `2111ad61…` (2026-04-09 rebuild with port-collision fix)

Why it took two days

The previous file logger used `fs.createWriteStream(LOG_FILE, {flags:'a'})` whose buffered `.write()` calls were never flushed when the worker thread crashed. Every failed launch left a `vault-backend.log` that appeared to end at `[Engine] Merged manifest` — actually 2-7 seconds before the actual death point. We chased three wrong root causes (libusb segfault on detach, semver throw in initialize, port-1646 collision) every one of which was downstream of a buffered log losing the actual exception line.

The first commit on this branch swaps the buffered stream for `fs.appendFileSync`. The very next failed launch produced this final line in `vault-backend.log`, which is the smoking gun:

```
[2026-04-09T22:01:38.479Z] ERR: Uncaught exception in worker: {
"code": "ENOENT",
"path": "bash",
"errno": -4058
}
```

Combined with `[Boot] env: PATH.length=0 LANG=` from the new boot env dump, the diagnosis was a 30-second `grep` for `'bash'` in `src/bun/`.

Verification

Before the fix (orig 1.2.14, Explorer launch with device plugged in)

```
[2026-04-09T22:01:38.472Z] [Engine] State → needs_pin
[2026-04-09T22:01:38.472Z] [Engine] Resolved BL hash fe98454e… → v2.1.4
[2026-04-09T22:01:38.479Z] ERR: Uncaught exception in worker: { code: 'ENOENT', path: 'bash', errno: -4058 }
[silence — process gone]
```

After the fix (same install, Explorer launch, same device)

```
[Boot] env: PATH.length=0 LANG= ← Explorer launch, empty PATH
[Vault] Heartbeat watchdog skipped on Windows (POSIX-only) ← the fix fired
[Engine] Merged manifest: latest=v7.10.0 beta=v7.14.0
[Engine] Firmware manifest (latest): fw=7.10.0 bl=2.1.4
[Engine] Scanning for WebUSB device...
[Engine] WebUSB device found, attempting pairRawDevice...
[Engine] Paired via WebUSB
[Engine] Initializing device...
[Engine] Features: { ... firmwareVersion: 7.14.0, ... }
[Engine] State → needs_pin
[PERF] +3876ms: boot complete ← clean boot
[Engine] PIN_REQUEST → type=current
[Engine] State → ready ← device paired and unlocked
```

What this PR does NOT fix

The three Win10 findings from `docs/HANDOFF-1.2.14-WINDOWS-PAIR.md` remain:

  1. Finding 1 (`Invalid Version: vundefined.undefined.undefined` in `KeepKeyHDWallet.initialize()`) — addressed by `keepkey/hdwallet#37` which still needs review/merge, followed by a submodule pointer bump here in a separate PR.
  2. Finding 2 (native crash on USB unplug during `pairRawDevice`) — separate bug. With the sync logger landing in this PR, the next reproduction should leave the actual death cause in the log.
  3. Finding 3 (port-1646 collision splash hang) — already fixed in the official 1.2.14 rebuild from 2026-04-09; verified working.

Files

  • `projects/keepkey-vault/src/bun/index.ts` — sync logger (~50 lines), boot env dump (~22 lines), watchdog Windows skip + try/catch (~10 lines), long comment block on the watchdog explaining the platform constraint.
  • `projects/keepkey-vault/src/bun/engine-controller.ts` — JS↔native boundary logs in `start()`, `fetchFirmwareManifest()`, and `applyChannel()`. Each native call wrapped in try/catch with explicit `[Engine] FATAL:` logging.
  • `docs/HANDOFF-1.2.14-WIN10-WATCHDOG-CRASH.md` — full diagnosis story, verification recipe, and reasoning for keeping the observability changes regardless of the watchdog fix.

Test plan

  • Manual: install 1.2.14, launch from desktop icon (broken context), verify crash with old code
  • Manual: apply this PR's patches in-place to the installed bundle, re-launch from desktop icon, verify boot reaches `boot complete` and device pairs
  • Manual: launch from terminal, verify logger captures full session (still works)
  • Manual: backend log first lines contain `[Boot] platform=win32 …` env dump
  • CI: existing test suite (no test changes in this PR)
  • Reviewer: skim the watchdog comment block — push back if any of the platform-constraint claims are wrong (kill -9 on Windows, date +%s availability, etc)

…ation

THE BUG
=======

projects/keepkey-vault/src/bun/index.ts:4246 unconditionally spawned
`bash -c '<heartbeat watchdog script>'` at module load. On Win10, when
the app is launched from Explorer / Start Menu / installer Run-now,
the bun worker inherits an empty PATH from the parent process.
Bun.spawn delegates to libuv, which fails with UV_ENOENT (-4058) when
it can't locate `bash`, and the failure is delivered as an UNCAUGHT
asynchronous exception in the worker thread several seconds after the
spawn call site. The exception kills the entire app right around the
device pair flow, and the user sees a splash that hangs and then
disappears.

The watchdog is POSIX-only — its script uses bash, kill -9, date +%s,
sleep, cat — and could never have functioned on Windows even if `bash`
were on PATH. It exists to defend against an FFI freeze in
kkemu confirm_helper that only happens on macOS/Linux builds.

This was the dominant Win10 1.2.14 first-launch crash. Reproduced
twice across two different rebuilt 1.2.14 binaries (4f8ec1ba… and
2111ad61…), and verified fixed in-place by patching the installed
bundle and re-launching from the desktop icon.

THE FIX
=======

1. startHeartbeatWatchdog() returns early on process.platform === 'win32'
   with a [Vault] Heartbeat watchdog skipped on Windows log line.
2. The Bun.spawn(['bash', ...]) call is wrapped in try/catch as
   defense-in-depth on POSIX hosts where bash could conceivably be
   missing (containers, minimal NixOS, etc).
3. A long comment block above the function explains the platform
   constraint and references this incident, so the next person to
   touch the watchdog doesn't accidentally re-enable it on Windows.

OBSERVABILITY (the only reason this was diagnosable)
====================================================

The previous logger used fs.createWriteStream(LOG_FILE, {flags:'a'})
whose buffered .write() calls never reached disk on a worker-thread
crash. Every failed launch left a vault-backend.log that "ended" 2-7
seconds before the actual death point, and we spent two days chasing
wrong root causes (libusb segfault, semver throw, port collision)
because the actual exception line never made it to disk.

This commit replaces the buffered stream with fs.appendFileSync per
log call. Throughput hit is negligible at our log volume (~10-100
lines/sec at peak boot); the upside is that the log file is now a
faithful record of what executed up to a crash.

It also adds a structured boot environment dump immediately after
[Boot] Log file: …, capturing platform, arch, pid, ppid, cwd, argv,
stdio TTY status, PATH.length, LANG, LC_ALL, and Windows-only env
vars (USERNAME, SESSIONNAME, APPDATA, LOCALAPPDATA). The PATH.length
field is what surfaced this bug — Explorer launches show
PATH.length=0, terminal launches show PATH.length=882. Without that
single field, the only evidence was "splash hangs" which is
consistent with twenty different root causes.

Finally, engine-controller.ts gets boundary log lines around every
JS↔native transition in start() and fetchFirmwareManifest() —
usb.on(attach), usb.on(detach), usb.getDeviceList(), mergeManifests,
applyChannel, syncState — each wrapped in try/catch with an explicit
[Engine] FATAL: log so a future libusb segfault leaves a clear
breadcrumb instead of a silent process exit.

WHAT THIS DOES NOT FIX
======================

The three findings from docs/HANDOFF-1.2.14-WINDOWS-PAIR.md remain:

- Finding 1 (Invalid Version: vundefined.undefined.undefined) is
  addressed by keepkey/hdwallet#37 which still needs review/merge,
  followed by a submodule pointer bump here.
- Finding 2 (native crash on USB unplug during pairRawDevice) is
  separate from this PR's bug. With the sync logger landing, the
  next reproduction should leave the actual death cause in the log.
- Finding 3 (port-1646 collision splash hang) is already fixed in
  the official 1.2.14 rebuild from 2026-04-09 — verified working.

See docs/HANDOFF-1.2.14-WIN10-WATCHDOG-CRASH.md for the full
diagnosis story, including the verification recipe.
@BitHighlander BitHighlander merged commit 0b12475 into develop Apr 9, 2026
2 checks passed
BitHighlander added a commit that referenced this pull request Apr 9, 2026
Bump version to 1.2.15 and pin hdwallet submodule to master (includes
keepkey/hdwallet#37 — Features version validation).

Changes since v1.2.14 pre-release:
- fix(win10): skip POSIX heartbeat watchdog on Windows (#107)
- fix: sync file logger — crash logs now survive native exceptions (#107)
- fix: boot environment dump for launch-context diagnostics (#107)
- fix: JS↔native boundary logging in engine-controller (#107)
- fix: port 1646 collision check before window creation (#106)
- fix: USB detach race guard on WebUSB pair (#106)
- fix: hdwallet Features version validation (hdwallet#37)
- docs: handoff for 1.2.14 Windows pair failures (#105)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant