Skip to content

feat(controller): pair with controller for channel/device pairing#1095

Merged
evanjacobson merged 44 commits intomainfrom
feature/kiloclaw-pair-with-controller
Mar 17, 2026
Merged

feat(controller): pair with controller for channel/device pairing#1095
evanjacobson merged 44 commits intomainfrom
feature/kiloclaw-pair-with-controller

Conversation

@evanjacobson
Copy link
Contributor

@evanjacobson evanjacobson commented Mar 13, 2026

Summary

Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via fly exec every time.

  • Pairing cache on the controller periodically runs kilo pairing list / kilo device-pairing list, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a new onStdoutLine supervisor hook).
  • HTTP routes at /_kilo/pairing/{channels,devices} and /_kilo/pairing/{channels,devices}/approve.
  • DO pairing methods now try the controller HTTP API first and fall back to fly exec on unknown-route errors, preserving backward compatibility with older controller versions.
  • Zod schemas for typed controller pairing responses.
  • 409 machine recovery: if createMachine hits a 409 "already_exists", adopts the existing machine instead of crashing. fix(kiloclaw): recover from 409 already_exists in createNewMachine #1135
  • Volume existence check after machine ID recovery to prevent startups against a missing volume.

Also included: billing infra hardening

These files are coupled to call sites already on main and cannot be split out without breaking the build:

  • stripe-handlers.ts — Stripe webhook handlers (subscription.created, .updated, .deleted, schedule events). Minor hardening: null-check on subscription items, improved comment accuracy. Known gap: subscription.updated handler reads suspended_at outside a transaction (TOCTOU with billing cron — tracked separately).
  • billing-lifecycle-cron.ts — Trial/earlybird/subscription expiry sweeps and email notifications. Hardening: added isNull(suspended_at) guard on earlybird sweep, wrapped email-log delete in try/catch. Known gap: sweep UPDATEs use bare WHERE user_id = ? without conditional status guards (TOCTOU with webhooks — tracked separately).
  • kiloclaw-router.ts — Billing endpoints (getBillingStatus, createSubscriptionCheckout, cancelSubscription, switchPlan, etc.) and baseProcedureclawAccessProcedure gating on existing endpoints.
  • modelSupport.ts — Minor refactor collapsing early-return conditions.

Verification

  • Pairing cache tests (749 lines) — pairing-cache.test.ts
  • Pairing route tests (410 lines) — routes/pairing.test.ts
  • Supervisor onStdoutLine tests (135 lines) — supervisor.test.ts
  • Manual testing — paired Telegram channel

Visual Changes

N/A

Reviewer Notes

  • The controller-first / fly-exec-fallback pattern uses isErrorUnknownRoute to detect older controllers that don't have the pairing routes, so rollout is safe without version gating.
  • The pairing cache debounces rapid refresh triggers and caps concurrent CLI invocations to avoid overwhelming the gateway process.
  • Cache warmup is fire-and-forget — start() kicks off the initial refresh but does not block server.listen(). This avoids a slow or wedged OpenClaw CLI (45s timeout per path) holding the health endpoint unreachable past the DO's 60s startup probe. The brief empty-cache window is acceptable: the DO falls back to KV / fly exec, and the cache self-heals via periodic refresh and log-triggered debounce.
  • The billing files carry two known TOCTOU race conditions (documented in cloud_todos.md) that are pre-existing and tracked for separate follow-up — not introduced or worsened by this PR.

- Add stopped guards to refresh methods preventing post-cleanup work
- Replace unsafe type casts with isRecord() helper in route handlers
- Wrap KV delete calls in try-catch to prevent cache invalidation failures
  from masking successful approvals
- Fix cross-reference comments and JSDoc accuracy
- Add tests covering empty-field filtering, post-cleanup behavior,
  detectChannels edge cases, start idempotency, post-approve refresh
  failures, and DO-level approve/refresh fallbacks
When a DO has userId/sandboxId but lost its flyMachineId (e.g. after
a DO migration or state reset), the start() flow would try to create
a new machine and hit a 409 "already_exists" from Fly.

Two fixes:
1. Run attemptMetadataRecovery in start() when flyMachineId is missing,
   not just during full Postgres restore.
2. Safety net: if createNewMachine gets a 409 "already_exists", extract
   the machine ID from the error body and adopt it via startExistingMachine.
The volume verification was guarded by !flyMachineId, so it was
skipped after metadata recovery set the machine ID. A stale
flyVolumeId would then be passed to updateMachine, causing a 400
"volume does not exist" error.
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 13, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts 160 Controller-side channel approvals skip legacy KV invalidation, so a later route-unavailable fallback can resurrect approved requests from stale pairing:* data.
kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts 329 Controller-side device approvals skip legacy KV invalidation, so a later route-unavailable fallback can serve stale approved device requests from device-pairing:* data.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (17 files)
  • kiloclaw/controller/src/index.ts - 0 issues
  • kiloclaw/controller/src/pairing-cache.test.ts - 0 issues
  • kiloclaw/controller/src/pairing-cache.ts - 0 issues
  • kiloclaw/controller/src/routes/pairing.test.ts - 0 issues
  • kiloclaw/controller/src/routes/pairing.ts - 0 issues
  • kiloclaw/controller/src/supervisor.test.ts - 0 issues
  • kiloclaw/controller/src/supervisor.ts - 0 issues
  • kiloclaw/src/durable-objects/gateway-controller-types.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts - 2 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
  • src/app/(app)/claw/components/modelSupport.ts - 0 issues
  • src/lib/kiloclaw/billing-lifecycle-cron.ts - 0 issues
  • src/lib/kiloclaw/stripe-handlers.ts - 0 issues
  • src/routers/kiloclaw-router.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 2,133,947 tokens

Resolved conflicts:
- controller/src/index.ts: combined pairing + gmail-watch/env features
- scripts/.dev-start.conf.example: took main's updated format + new env var
- src/routes/platform.ts: added Gmail safe error prefix from main
- Fixed supervisor.test.ts: gatewayArgs → args to match renamed option
…emoved

refreshChannelPairing() returned early when detectChannels() yielded no
channels, leaving the previous channelCache intact. Users who removed
their last channel token would still see old pending pairing requests
until the controller restarted.
…delay

New controllers returned empty pairing queues for the first 5 seconds
because start() scheduled the initial refreshAll behind a setTimeout.
…hine-recovery)

The 409 already_exists recovery in createNewMachine has been extracted
to a dedicated branch with proper logging and test coverage.
… logging

- Replace manual PairingRequest/DevicePairingRequest types with z.infer
  from the shared Zod schemas, reducing type duplication
- Replace generic parseCachedRequests with schema-validated parsers
- Extract regex constants (CHANNEL_RE, CODE_RE, UUID_RE) for reuse
- Include exit_code, stderr/stdout, and parse errors in log messages
- Remove try/catch around post-approve refreshes (let errors propagate)
- Remove .catch wrappers around refreshAll calls (Promise.allSettled
  handles errors internally per-refresh)
- Add WARNING log prefix when a previously-successful channel fails
- Log when all channels fail and cache is not updated
- Export isRecord helper to avoid duplication in routes
- Remove unnecessary `as const` assertions in factory functions
…ponses

- Remove try/catch around refresh calls — errors now surface as 500s
  instead of silently returning stale cached data
- Remove duplicate isRecord (use export from pairing-cache)
- Remove redundant `error` field from error responses (message suffices)
- Simplify approveResponse to always use { success, message }
Prevent approveChannel/approveDevice from shelling out to the CLI
after cleanup() has been called.
- Log controller version check failures instead of empty catch block
- Log non-404 getVolume errors that were silently dropped
Reverts the awaited start() from 7b3ca65. That commit addressed the
"empty cache on fresh startup" review comment, but it introduced a worse
problem: each CLI path has a 45s timeout, so a slow or wedged OpenClaw
CLI can hold server.listen() long enough for the DO's 60s health probe
to fail — turning a non-critical cache miss into a full machine startup
failure.

The empty-cache window is acceptable because:
- The DO-side fallback chain (controller → KV → fly exec) handles it
- The cache self-heals within seconds via periodic refresh and
  log-triggered debounce
- Callers can pass ?refresh=true for on-demand population
…lo-Org/cloud into feature/kiloclaw-pair-with-controller
…st-path

When attemptMetadataRecovery() finds a started machine and sets status
to 'running', the start() fast-path verified the machine was alive and
returned early — but skipped scheduleAlarm(). This left recovered
instances without health checks or self-healing until a later state
transition happened to reschedule one.
- Signal handlers now catch onSignal() rejections and force exit(1)
  instead of silently hanging the process
- trySendEmail: wrap DB delete in nested try/catch so a delete failure
  no longer shadows the original send error
- Earlybird warning query: exclude suspended users via isNull(suspended_at)
- reconcileApiKeyExpiry: use finally to clear mintTimeoutId, avoiding
  use-before-assignment and ensuring timeout is always cancelled
- reconcileMachineMount: move flyMachineId null guard before the log
  call that references it
- getSubscriptionPeriods: log a warning when items.data is empty
- getGatewayToken: log warning instead of swallowing error silently
- pairing.ts: invert empty-if fall-through pattern to if (!cond) throw
- pairing.ts: add sandboxId/appId context to channel list error logs
- pairing-cache.ts: remove always-true else-if condition and redundant comments
…t coverage

- Fix reconcileStarting stuck-in-starting when transient Fly API errors
  persist past STARTING_TIMEOUT_MS — now transitions to 'stopped'
- Preserve stale pairing-cache data for channels whose CLI refresh fails
  instead of silently dropping their requests
- Add console.warn breadcrumbs in DO pairing catch blocks before
  re-throwing non-route controller errors
- Simplify error extraction in callGatewayController (hoist bodyObj,
  replace nested ternary)
- Add 13 new tests covering: timeout-vs-transient reconciliation,
  stale-cache preservation, approve-after-cleanup guards, KV cache
  intermediate fallback, throwing onStdoutLine resilience, and
  non-running pairing guards
@evanjacobson evanjacobson enabled auto-merge March 17, 2026 03:20
- Add generation counters to pairing cache refreshes to prevent stale
  overwrites from concurrent refresh calls (timer/debounce/request)
- Guard start() against zombie resurrection after destroy()+deleteAll()
  by also aborting on undefined storage status
- Add startInProgress flag to prevent concurrent start() calls from
  creating duplicate Fly machines via waitUntil + RPC interleaving
…lo-Org/cloud into feature/kiloclaw-pair-with-controller
.delete(kiloclaw_email_log)
.where(
and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType))
try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the delete also fails (transient DB error), the original send error must still propagate — without this, a delete failure masks the real error and the sweep iteration crashes on the wrong exception

)
.where(
sql`(${kiloclaw_subscriptions.status} IS NULL OR ${kiloclaw_subscriptions.status} NOT IN ('active', 'trialing'))`
and(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this guard, the sweep sends "earlybird ending" emails to users who are already locked out by a prior cron run or webhook suspension.


function getSubscriptionPeriods(subscription: Stripe.Subscription) {
// In the current Stripe API, period timestamps live on the subscription item, not the subscription.
// Stripe moved period timestamps to the item level (not the top-level subscription object).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripe subscriptions should always have at least one item, but log if not so we can investigate rather than silently returning nulls

@evanjacobson
Copy link
Contributor Author

Related: #1135

@evanjacobson
Copy link
Contributor Author

// Stripe moved period timestamps to the item level (not the top-level subscription object).
const item = subscription.items.data[0];
if (!item) {
console.warn('[stripe] Subscription has no items:', subscription.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using logWarning here

and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType))
);
} catch (deleteError) {
console.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using logError here

Copy link
Contributor

@jeanduplessis jeanduplessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving approval specifically for billing changes. We should get @pandemicsyn's final approval for the pairing work.

@evanjacobson evanjacobson merged commit aac35a5 into main Mar 17, 2026
18 checks passed
@evanjacobson evanjacobson deleted the feature/kiloclaw-pair-with-controller branch March 17, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants