feat(controller): pair with controller for channel/device pairing#1095
feat(controller): pair with controller for channel/device pairing#1095evanjacobson merged 44 commits intomainfrom
Conversation
…iented stdout monitoring
…and initial fetch
…ng, reduce type assertions
…ntroller startup/shutdown
- Add stopped guards to refresh methods preventing post-cleanup work - Replace unsafe type casts with isRecord() helper in route handlers - Wrap KV delete calls in try-catch to prevent cache invalidation failures from masking successful approvals - Fix cross-reference comments and JSDoc accuracy - Add tests covering empty-field filtering, post-cleanup behavior, detectChannels edge cases, start idempotency, post-approve refresh failures, and DO-level approve/refresh fallbacks
When a DO has userId/sandboxId but lost its flyMachineId (e.g. after a DO migration or state reset), the start() flow would try to create a new machine and hit a 409 "already_exists" from Fly. Two fixes: 1. Run attemptMetadataRecovery in start() when flyMachineId is missing, not just during full Postgres restore. 2. Safety net: if createNewMachine gets a 409 "already_exists", extract the machine ID from the error body and adopt it via startExistingMachine.
The volume verification was guarded by !flyMachineId, so it was skipped after metadata recovery set the machine ID. A stale flyVolumeId would then be passed to updateMachine, causing a 400 "volume does not exist" error.
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: N/A Files Reviewed (17 files)
Reviewed by gpt-5.4-20260305 · 2,133,947 tokens |
Resolved conflicts: - controller/src/index.ts: combined pairing + gmail-watch/env features - scripts/.dev-start.conf.example: took main's updated format + new env var - src/routes/platform.ts: added Gmail safe error prefix from main - Fixed supervisor.test.ts: gatewayArgs → args to match renamed option
…w-pair-with-controller
…emoved refreshChannelPairing() returned early when detectChannels() yielded no channels, leaving the previous channelCache intact. Users who removed their last channel token would still see old pending pairing requests until the controller restarted.
…delay New controllers returned empty pairing queues for the first 5 seconds because start() scheduled the initial refreshAll behind a setTimeout.
…hine-recovery) The 409 already_exists recovery in createNewMachine has been extracted to a dedicated branch with proper logging and test coverage.
… logging - Replace manual PairingRequest/DevicePairingRequest types with z.infer from the shared Zod schemas, reducing type duplication - Replace generic parseCachedRequests with schema-validated parsers - Extract regex constants (CHANNEL_RE, CODE_RE, UUID_RE) for reuse - Include exit_code, stderr/stdout, and parse errors in log messages
- Remove try/catch around post-approve refreshes (let errors propagate) - Remove .catch wrappers around refreshAll calls (Promise.allSettled handles errors internally per-refresh) - Add WARNING log prefix when a previously-successful channel fails - Log when all channels fail and cache is not updated - Export isRecord helper to avoid duplication in routes - Remove unnecessary `as const` assertions in factory functions
…ponses
- Remove try/catch around refresh calls — errors now surface as 500s
instead of silently returning stale cached data
- Remove duplicate isRecord (use export from pairing-cache)
- Remove redundant `error` field from error responses (message suffices)
- Simplify approveResponse to always use { success, message }
Prevent approveChannel/approveDevice from shelling out to the CLI after cleanup() has been called.
- Log controller version check failures instead of empty catch block - Log non-404 getVolume errors that were silently dropped
Reverts the awaited start() from 7b3ca65. That commit addressed the "empty cache on fresh startup" review comment, but it introduced a worse problem: each CLI path has a 45s timeout, so a slow or wedged OpenClaw CLI can hold server.listen() long enough for the DO's 60s health probe to fail — turning a non-critical cache miss into a full machine startup failure. The empty-cache window is acceptable because: - The DO-side fallback chain (controller → KV → fly exec) handles it - The cache self-heals within seconds via periodic refresh and log-triggered debounce - Callers can pass ?refresh=true for on-demand population
…lo-Org/cloud into feature/kiloclaw-pair-with-controller
…st-path When attemptMetadataRecovery() finds a started machine and sets status to 'running', the start() fast-path verified the machine was alive and returned early — but skipped scheduleAlarm(). This left recovered instances without health checks or self-healing until a later state transition happened to reschedule one.
- Signal handlers now catch onSignal() rejections and force exit(1) instead of silently hanging the process - trySendEmail: wrap DB delete in nested try/catch so a delete failure no longer shadows the original send error - Earlybird warning query: exclude suspended users via isNull(suspended_at) - reconcileApiKeyExpiry: use finally to clear mintTimeoutId, avoiding use-before-assignment and ensuring timeout is always cancelled - reconcileMachineMount: move flyMachineId null guard before the log call that references it - getSubscriptionPeriods: log a warning when items.data is empty - getGatewayToken: log warning instead of swallowing error silently - pairing.ts: invert empty-if fall-through pattern to if (!cond) throw - pairing.ts: add sandboxId/appId context to channel list error logs - pairing-cache.ts: remove always-true else-if condition and redundant comments
…t coverage - Fix reconcileStarting stuck-in-starting when transient Fly API errors persist past STARTING_TIMEOUT_MS — now transitions to 'stopped' - Preserve stale pairing-cache data for channels whose CLI refresh fails instead of silently dropping their requests - Add console.warn breadcrumbs in DO pairing catch blocks before re-throwing non-route controller errors - Simplify error extraction in callGatewayController (hoist bodyObj, replace nested ternary) - Add 13 new tests covering: timeout-vs-transient reconciliation, stale-cache preservation, approve-after-cleanup guards, KV cache intermediate fallback, throwing onStdoutLine resilience, and non-running pairing guards
- Add generation counters to pairing cache refreshes to prevent stale overwrites from concurrent refresh calls (timer/debounce/request) - Guard start() against zombie resurrection after destroy()+deleteAll() by also aborting on undefined storage status - Add startInProgress flag to prevent concurrent start() calls from creating duplicate Fly machines via waitUntil + RPC interleaving
…lo-Org/cloud into feature/kiloclaw-pair-with-controller
| .delete(kiloclaw_email_log) | ||
| .where( | ||
| and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType)) | ||
| try { |
There was a problem hiding this comment.
if the delete also fails (transient DB error), the original send error must still propagate — without this, a delete failure masks the real error and the sweep iteration crashes on the wrong exception
| ) | ||
| .where( | ||
| sql`(${kiloclaw_subscriptions.status} IS NULL OR ${kiloclaw_subscriptions.status} NOT IN ('active', 'trialing'))` | ||
| and( |
There was a problem hiding this comment.
without this guard, the sweep sends "earlybird ending" emails to users who are already locked out by a prior cron run or webhook suspension.
|
|
||
| function getSubscriptionPeriods(subscription: Stripe.Subscription) { | ||
| // In the current Stripe API, period timestamps live on the subscription item, not the subscription. | ||
| // Stripe moved period timestamps to the item level (not the top-level subscription object). |
There was a problem hiding this comment.
Stripe subscriptions should always have at least one item, but log if not so we can investigate rather than silently returning nulls
|
Related: #1135 |
| // Stripe moved period timestamps to the item level (not the top-level subscription object). | ||
| const item = subscription.items.data[0]; | ||
| if (!item) { | ||
| console.warn('[stripe] Subscription has no items:', subscription.id); |
There was a problem hiding this comment.
Suggest using logWarning here
| and(eq(kiloclaw_email_log.user_id, userId), eq(kiloclaw_email_log.email_type, emailType)) | ||
| ); | ||
| } catch (deleteError) { | ||
| console.error( |
There was a problem hiding this comment.
Suggest using logError here
jeanduplessis
left a comment
There was a problem hiding this comment.
Giving approval specifically for billing changes. We should get @pandemicsyn's final approval for the pairing work.
Summary
Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via
fly execevery time.kilo pairing list/kilo device-pairing list, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a newonStdoutLinesupervisor hook)./_kilo/pairing/{channels,devices}and/_kilo/pairing/{channels,devices}/approve.fly execon unknown-route errors, preserving backward compatibility with older controller versions.409 machine recovery: iffix(kiloclaw): recover from 409 already_exists in createNewMachine #1135createMachinehits a 409 "already_exists", adopts the existing machine instead of crashing.Also included: billing infra hardening
These files are coupled to call sites already on
mainand cannot be split out without breaking the build:stripe-handlers.ts— Stripe webhook handlers (subscription.created,.updated,.deleted, schedule events). Minor hardening: null-check on subscription items, improved comment accuracy. Known gap:subscription.updatedhandler readssuspended_atoutside a transaction (TOCTOU with billing cron — tracked separately).billing-lifecycle-cron.ts— Trial/earlybird/subscription expiry sweeps and email notifications. Hardening: addedisNull(suspended_at)guard on earlybird sweep, wrapped email-log delete in try/catch. Known gap: sweep UPDATEs use bareWHERE user_id = ?without conditional status guards (TOCTOU with webhooks — tracked separately).kiloclaw-router.ts— Billing endpoints (getBillingStatus,createSubscriptionCheckout,cancelSubscription,switchPlan, etc.) andbaseProcedure→clawAccessProceduregating on existing endpoints.modelSupport.ts— Minor refactor collapsing early-return conditions.Verification
pairing-cache.test.tsroutes/pairing.test.tsonStdoutLinetests (135 lines) —supervisor.test.tsVisual Changes
N/A
Reviewer Notes
isErrorUnknownRouteto detect older controllers that don't have the pairing routes, so rollout is safe without version gating.start()kicks off the initial refresh but does not blockserver.listen(). This avoids a slow or wedged OpenClaw CLI (45s timeout per path) holding the health endpoint unreachable past the DO's 60s startup probe. The brief empty-cache window is acceptable: the DO falls back to KV / fly exec, and the cache self-heals via periodic refresh and log-triggered debounce.cloud_todos.md) that are pre-existing and tracked for separate follow-up — not introduced or worsened by this PR.