feat(controller): pair with controller for channel/device pairing by evanjacobson · Pull Request #1095 · Kilo-Org/cloud

evanjacobson · 2026-03-13T23:09:54Z

Summary

Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via fly exec every time.

Pairing cache on the controller periodically runs kilo pairing list / kilo device-pairing list, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a new onStdoutLine supervisor hook).
HTTP routes at /_kilo/pairing/{channels,devices} and /_kilo/pairing/{channels,devices}/approve.
DO pairing methods now try the controller HTTP API first and fall back to fly exec on unknown-route errors, preserving backward compatibility with older controller versions.
Zod schemas for typed controller pairing responses.
409 machine recovery: if createMachine hits a 409 "already_exists", adopts the existing machine instead of crashing. fix(kiloclaw): recover from 409 already_exists in createNewMachine #1135
Volume existence check after machine ID recovery to prevent startups against a missing volume.

Also included: billing infra hardening

These files are coupled to call sites already on main and cannot be split out without breaking the build:

stripe-handlers.ts — Stripe webhook handlers (subscription.created, .updated, .deleted, schedule events). Minor hardening: null-check on subscription items, improved comment accuracy. Known gap: subscription.updated handler reads suspended_at outside a transaction (TOCTOU with billing cron — tracked separately).
billing-lifecycle-cron.ts — Trial/earlybird/subscription expiry sweeps and email notifications. Hardening: added isNull(suspended_at) guard on earlybird sweep, wrapped email-log delete in try/catch. Known gap: sweep UPDATEs use bare WHERE user_id = ? without conditional status guards (TOCTOU with webhooks — tracked separately).
kiloclaw-router.ts — Billing endpoints (getBillingStatus, createSubscriptionCheckout, cancelSubscription, switchPlan, etc.) and baseProcedure → clawAccessProcedure gating on existing endpoints.
modelSupport.ts — Minor refactor collapsing early-return conditions.

Verification

Pairing cache tests (749 lines) — pairing-cache.test.ts
Pairing route tests (410 lines) — routes/pairing.test.ts
Supervisor onStdoutLine tests (135 lines) — supervisor.test.ts
Manual testing — paired Telegram channel

Visual Changes

N/A

Reviewer Notes

The controller-first / fly-exec-fallback pattern uses isErrorUnknownRoute to detect older controllers that don't have the pairing routes, so rollout is safe without version gating.
The pairing cache debounces rapid refresh triggers and caps concurrent CLI invocations to avoid overwhelming the gateway process.
Cache warmup is fire-and-forget — start() kicks off the initial refresh but does not block server.listen(). This avoids a slow or wedged OpenClaw CLI (45s timeout per path) holding the health endpoint unreachable past the DO's 60s startup probe. The brief empty-cache window is acceptable: the DO falls back to KV / fly exec, and the cache self-heals via periodic refresh and log-triggered debounce.
The billing files carry two known TOCTOU race conditions (documented in cloud_todos.md) that are pre-existing and tracked for separate follow-up — not introduced or worsened by this PR.

…iented stdout monitoring

…and initial fetch

…ng, reduce type assertions

…vice pairing

…ly exec

…ntroller startup/shutdown

- Add stopped guards to refresh methods preventing post-cleanup work - Replace unsafe type casts with isRecord() helper in route handlers - Wrap KV delete calls in try-catch to prevent cache invalidation failures from masking successful approvals - Fix cross-reference comments and JSDoc accuracy - Add tests covering empty-field filtering, post-cleanup behavior, detectChannels edge cases, start idempotency, post-approve refresh failures, and DO-level approve/refresh fallbacks

When a DO has userId/sandboxId but lost its flyMachineId (e.g. after a DO migration or state reset), the start() flow would try to create a new machine and hit a 409 "already_exists" from Fly. Two fixes: 1. Run attemptMetadataRecovery in start() when flyMachineId is missing, not just during full Postgres restore. 2. Safety net: if createNewMachine gets a 409 "already_exists", extract the machine ID from the error body and adopt it via startExistingMachine.

The volume verification was guarded by !flyMachineId, so it was skipped after metadata recovery set the machine ID. A stale flyVolumeId would then be passed to updateMachine, causing a 400 "volume does not exist" error.

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

kilo-code-bot · 2026-03-13T23:14:21Z

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	2
SUGGESTION	0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File	Line	Issue
`kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts`	160	Controller-side channel approvals skip legacy KV invalidation, so a later route-unavailable fallback can resurrect approved requests from stale `pairing:*` data.
`kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts`	329	Controller-side device approvals skip legacy KV invalidation, so a later route-unavailable fallback can serve stale approved device requests from `device-pairing:*` data.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (17 files)

kiloclaw/controller/src/index.ts - 0 issues
kiloclaw/controller/src/pairing-cache.test.ts - 0 issues
kiloclaw/controller/src/pairing-cache.ts - 0 issues
kiloclaw/controller/src/routes/pairing.test.ts - 0 issues
kiloclaw/controller/src/routes/pairing.ts - 0 issues
kiloclaw/controller/src/supervisor.test.ts - 0 issues
kiloclaw/controller/src/supervisor.ts - 0 issues
kiloclaw/src/durable-objects/gateway-controller-types.ts - 0 issues
kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts - 0 issues
kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 0 issues
kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts - 2 issues
kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
src/app/(app)/claw/components/modelSupport.ts - 0 issues
src/lib/kiloclaw/billing-lifecycle-cron.ts - 0 issues
src/lib/kiloclaw/stripe-handlers.ts - 0 issues
src/routers/kiloclaw-router.ts - 0 issues

_{Reviewed by gpt-5.4-20260305 · 2,133,947 tokens}

…region check

Resolved conflicts: - controller/src/index.ts: combined pairing + gmail-watch/env features - scripts/.dev-start.conf.example: took main's updated format + new env var - src/routes/platform.ts: added Gmail safe error prefix from main - Fixed supervisor.test.ts: gatewayArgs → args to match renamed option

…w-pair-with-controller

kiloclaw/controller/src/pairing-cache.ts

…emoved refreshChannelPairing() returned early when detectChannels() yielded no channels, leaving the previous channelCache intact. Users who removed their last channel token would still see old pending pairing requests until the controller restarted.

…delay New controllers returned empty pairing queues for the first 5 seconds because start() scheduled the initial refreshAll behind a setTimeout.

…hine-recovery) The 409 already_exists recovery in createNewMachine has been extracted to a dedicated branch with proper logging and test coverage.

… logging - Replace manual PairingRequest/DevicePairingRequest types with z.infer from the shared Zod schemas, reducing type duplication - Replace generic parseCachedRequests with schema-validated parsers - Extract regex constants (CHANNEL_RE, CODE_RE, UUID_RE) for reuse - Include exit_code, stderr/stdout, and parse errors in log messages

- Remove try/catch around post-approve refreshes (let errors propagate) - Remove .catch wrappers around refreshAll calls (Promise.allSettled handles errors internally per-refresh) - Add WARNING log prefix when a previously-successful channel fails - Log when all channels fail and cache is not updated - Export isRecord helper to avoid duplication in routes - Remove unnecessary `as const` assertions in factory functions

…ponses - Remove try/catch around refresh calls — errors now surface as 500s instead of silently returning stale cached data - Remove duplicate isRecord (use export from pairing-cache) - Remove redundant `error` field from error responses (message suffices) - Simplify approveResponse to always use { success, message }

Prevent approveChannel/approveDevice from shelling out to the CLI after cleanup() has been called.

- Log controller version check failures instead of empty catch block - Log non-404 getVolume errors that were silently dropped

kiloclaw/controller/src/index.ts

Reverts the awaited start() from 7b3ca65. That commit addressed the "empty cache on fresh startup" review comment, but it introduced a worse problem: each CLI path has a 45s timeout, so a slow or wedged OpenClaw CLI can hold server.listen() long enough for the DO's 60s health probe to fail — turning a non-critical cache miss into a full machine startup failure. The empty-cache window is acceptable because: - The DO-side fallback chain (controller → KV → fly exec) handles it - The cache self-heals within seconds via periodic refresh and log-triggered debounce - Callers can pass ?refresh=true for on-demand population

…lo-Org/cloud into feature/kiloclaw-pair-with-controller

kiloclaw/src/durable-objects/kiloclaw-instance/index.ts

…st-path When attemptMetadataRecovery() finds a started machine and sets status to 'running', the start() fast-path verified the machine was alive and returned early — but skipped scheduleAlarm(). This left recovered instances without health checks or self-healing until a later state transition happened to reschedule one.

- Signal handlers now catch onSignal() rejections and force exit(1) instead of silently hanging the process - trySendEmail: wrap DB delete in nested try/catch so a delete failure no longer shadows the original send error - Earlybird warning query: exclude suspended users via isNull(suspended_at) - reconcileApiKeyExpiry: use finally to clear mintTimeoutId, avoiding use-before-assignment and ensuring timeout is always cancelled - reconcileMachineMount: move flyMachineId null guard before the log call that references it - getSubscriptionPeriods: log a warning when items.data is empty - getGatewayToken: log warning instead of swallowing error silently - pairing.ts: invert empty-if fall-through pattern to if (!cond) throw - pairing.ts: add sandboxId/appId context to channel list error logs - pairing-cache.ts: remove always-true else-if condition and redundant comments

…t coverage - Fix reconcileStarting stuck-in-starting when transient Fly API errors persist past STARTING_TIMEOUT_MS — now transitions to 'stopped' - Preserve stale pairing-cache data for channels whose CLI refresh fails instead of silently dropping their requests - Add console.warn breadcrumbs in DO pairing catch blocks before re-throwing non-route controller errors - Simplify error extraction in callGatewayController (hoist bodyObj, replace nested ternary) - Add 13 new tests covering: timeout-vs-transient reconciliation, stale-cache preservation, approve-after-cleanup guards, KV cache intermediate fallback, throwing onStdoutLine resilience, and non-running pairing guards

kiloclaw/controller/src/pairing-cache.ts

kiloclaw/src/durable-objects/kiloclaw-instance/index.ts

- Add generation counters to pairing cache refreshes to prevent stale overwrites from concurrent refresh calls (timer/debounce/request) - Guard start() against zombie resurrection after destroy()+deleteAll() by also aborting on undefined storage status - Add startInProgress flag to prevent concurrent start() calls from creating duplicate Fly machines via waitUntil + RPC interleaving

…lo-Org/cloud into feature/kiloclaw-pair-with-controller

evanjacobson · 2026-03-17T05:48:23Z

src/lib/kiloclaw/billing-lifecycle-cron.ts

-      .delete(kiloclaw_email_log)
-      .where(
-        and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType))
+    try {


if the delete also fails (transient DB error), the original send error must still propagate — without this, a delete failure masks the real error and the sweep iteration crashes on the wrong exception

evanjacobson · 2026-03-17T05:48:49Z

src/lib/kiloclaw/billing-lifecycle-cron.ts

      )
      .where(
-        sql`(${kiloclaw_subscriptions.status} IS NULL OR ${kiloclaw_subscriptions.status} NOT IN ('active', 'trialing'))`
+        and(


without this guard, the sweep sends "earlybird ending" emails to users who are already locked out by a prior cron run or webhook suspension.

evanjacobson · 2026-03-17T05:49:13Z

src/lib/kiloclaw/stripe-handlers.ts


 function getSubscriptionPeriods(subscription: Stripe.Subscription) {
-  // In the current Stripe API, period timestamps live on the subscription item, not the subscription.
+  // Stripe moved period timestamps to the item level (not the top-level subscription object).


Stripe subscriptions should always have at least one item, but log if not so we can investigate rather than silently returning nulls

kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts

evanjacobson · 2026-03-17T15:13:57Z

Related: #1135

evanjacobson · 2026-03-17T15:15:32Z

Spec: https://github.com/orgs/Kilo-Org/projects/30/views/1?pane=issue&itemId=164917045

jeanduplessis · 2026-03-17T16:25:59Z

src/lib/kiloclaw/stripe-handlers.ts

+  // Stripe moved period timestamps to the item level (not the top-level subscription object).
  const item = subscription.items.data[0];
+  if (!item) {
+    console.warn('[stripe] Subscription has no items:', subscription.id);


Suggest using logWarning here

jeanduplessis · 2026-03-17T16:26:22Z

src/lib/kiloclaw/billing-lifecycle-cron.ts

+          and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType))
+        );
+    } catch (deleteError) {
+      console.error(


Suggest using logError here

jeanduplessis

Giving approval specifically for billing changes. We should get @pandemicsyn's final approval for the pairing work.

evanjacobson added 14 commits March 13, 2026 16:45

Add dev-start script and config example

52c1065

feat(worker): add Zod schemas for controller pairing responses

c5cd564

feat(controller): add onStdoutLine callback to supervisor for line-or…

4b41cc2

…iented stdout monitoring

feat(controller): add pairing cache with periodic refresh, debounce, …

8a4f70f

…and initial fetch

fix(controller): pairing cache - use full CLI path, add failure loggi…

c90cffe

…ng, reduce type assertions

feat(controller): add /_kilo/pairing/* HTTP routes for channel and de…

f5f0099

…vice pairing

feat(worker): DO pairing methods try controller first, fall back to f…

29c1787

…ly exec

feat(controller): wire pairing cache, routes, and stdout hook into co…

22de0f0

…ntroller startup/shutdown

style: auto-format

a5f650e

fix: verify volume exists even after machine ID recovery

ca546d1

The volume verification was guarded by !flyMachineId, so it was skipped after metadata recovery set the machine ID. A stale flyVolumeId would then be passed to updateMachine, causing a 400 "volume does not exist" error.

Undo other branch's changes

ba68b5c

Auto-format and undo some changes not intended for this branch

f53e37c

kilo-code-bot bot reviewed Mar 13, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts Outdated Show resolved Hide resolved

evanjacobson added 3 commits March 16, 2026 08:26

test(kiloclaw): fix test isolation for start() metadata recovery and …

f5caf87

…region check

Merge branch 'main' of github.com:Kilo-Org/cloud into feature/kilocla…

0d0569a

…w-pair-with-controller

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/controller/src/pairing-cache.ts Outdated Show resolved Hide resolved

kiloclaw/controller/src/pairing-cache.ts Outdated Show resolved Hide resolved

evanjacobson added 10 commits March 16, 2026 11:11

fix(pairing-cache): refresh immediately on start instead of after 5s …

6e6d18a

…delay New controllers returned empty pairing queues for the first 5 seconds because start() scheduled the initial refreshAll behind a setTimeout.

refactor(kiloclaw): remove 409 recovery (moved to improvement/409-mac…

35c05c2

…hine-recovery) The 409 already_exists recovery in createNewMachine has been extracted to a dedicated branch with proper logging and test coverage.

docs(supervisor): clarify clean-exit respawn comment

39b92f9

fix(pairing-cache): add stopped guard to approve methods

616b3b4

Prevent approveChannel/approveDevice from shelling out to the CLI after cleanup() has been called.

fix(reconcile): add logging for silently swallowed errors

45d970e

- Log controller version check failures instead of empty catch block - Log non-404 getVolume errors that were silently dropped

(chore) run formatter

90eb28b

Merge branch 'main' into feature/kiloclaw-pair-with-controller

625031f

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/controller/src/index.ts Outdated Show resolved Hide resolved

evanjacobson added 3 commits March 16, 2026 14:25

Merge branch 'feature/kiloclaw-pair-with-controller' of github.com:Ki…

e1127fc

…lo-Org/cloud into feature/kiloclaw-pair-with-controller

fix(merge): resolve conflict in kiloclaw-instance test after main merge

a14717f

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/index.ts Outdated Show resolved Hide resolved

evanjacobson added 4 commits March 16, 2026 16:48

Merge branch 'main' into feature/kiloclaw-pair-with-controller

f735e87

evanjacobson enabled auto-merge March 17, 2026 03:20

kilo-code-bot bot reviewed Mar 17, 2026

View reviewed changes

kiloclaw/controller/src/pairing-cache.ts Show resolved Hide resolved

kiloclaw/src/durable-objects/kiloclaw-instance/index.ts Show resolved Hide resolved

evanjacobson added 3 commits March 16, 2026 22:56

Merge branch 'feature/kiloclaw-pair-with-controller' of github.com:Ki…

65a761f

…lo-Org/cloud into feature/kiloclaw-pair-with-controller

Remove pre-commit hook

f875a89

evanjacobson commented Mar 17, 2026

View reviewed changes

jeanduplessis requested review from jeanduplessis and pandemicsyn March 17, 2026 13:30

Merge branch 'main' into feature/kiloclaw-pair-with-controller

5b65cb9

kilo-code-bot bot reviewed Mar 17, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts Show resolved Hide resolved

kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts Show resolved Hide resolved

jeanduplessis reviewed Mar 17, 2026

View reviewed changes

jeanduplessis approved these changes Mar 17, 2026

View reviewed changes

evanjacobson merged commit aac35a5 into main Mar 17, 2026
18 checks passed

evanjacobson deleted the feature/kiloclaw-pair-with-controller branch March 17, 2026 16:27

evanjacobson mentioned this pull request Mar 17, 2026

Revert "feat(controller): pair with controller for channel/device pairing" #1179

Closed

Conversation

evanjacobson commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Also included: billing infra hardening

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evanjacobson Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

evanjacobson Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

evanjacobson Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evanjacobson commented Mar 17, 2026

Uh oh!

evanjacobson commented Mar 17, 2026

Uh oh!

jeanduplessis Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jeanduplessis Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jeanduplessis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evanjacobson commented Mar 13, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 13, 2026 •

edited

Loading