docs: add agentcube snapstart proposal#366
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @lyuyun! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for AgentCube SnapStart, aiming to significantly reduce sandbox session startup latency using Kuasar's WarmForkSnapshot mechanism. The proposal details the API design, internal controller implementation, ready-waiting protocols, and a multi-phase evolution plan. The feedback points out a missing Go type definition and constants for PerNodeSnapshotPhase in the API design section, which should be added to ensure completeness.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a new design document proposing the AgentCube SnapStart feature, which leverages Kuasar's WarmForkSnapshot mechanism to reduce sandbox session startup latency from 5–12 s to 0.5–2 s by restoring a runtime from a pre-initialized checkpoint instead of cold-starting.
Changes:
- Introduces a dedicated
SnapStartCRD (separate fromCodeInterpreter/AgentRuntime) with artifact/placement/invalidation semantics and Phase 1 node-local storage backend. - Specifies the ready-waiting protocol (inject socket,
GET /runtime/status, PREPARE/COMMIT/STARTED flow, PRNG reseed) and anagentdKuasar Admin HTTP proxy for Workload Manager. - Lays out a multi-phase plan: Phase 1 Code Interpreter, Phase 2 Browser Agent (BrowserWarmFork with 9 quiescence constraints), and future distributed artifact distribution.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #366 +/- ##
===========================================
+ Coverage 47.57% 57.90% +10.33%
===========================================
Files 30 34 +4
Lines 2819 3181 +362
===========================================
+ Hits 1341 1842 +501
+ Misses 1338 1154 -184
- Partials 140 185 +45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Hi @lyuyun, I’m interested in working on this direction as well. I had already opened issue #365 with a Firecracker-based proposal, but after reading your PR I think the Kuasar SnapStart approach looks like a much better fit for AgentCube’s Kubernetes-native architecture. |
Hi @Abhinav-kodes, thanks for taking a look and for the thoughtful proposal in #365. |
| | SnapStart | Runtime memory template | Snapshot storage and restore cost | Avoids runtime initialization delay | | ||
| | SnapStart WarmPool | Restored runtime instances | CoW memory and idle CPU | Avoids restore latency too | | ||
|
|
||
| In Phase 1, SnapStart and `spec.warmPoolSize` are mutually exclusive. A later phase can define a combined model with clear metrics and cost attribution. |
There was a problem hiding this comment.
Phase 1 says SnapStart and spec.warmPoolSize are mutually exclusive, but the create path later lets a ready snapshot bypass SandboxClaim and fall back to SandboxWarmPool when unavailable.
There was a problem hiding this comment.
Thanks for pointing this out. I will revise the proposal to clarify the relationship between SandboxWarmPool and SnapStart.
They operate at different layers:
CodeInterpreter.spec.warmPoolSizecontrols AgentCubeSandboxWarmPoolallocation. When aSandboxClaimhits an existing ready slot, AgentCube binds an already-running Sandbox. No additional Kuasar startup or snapshot restore occurs.SnapStartis a Kuasar sandbox-start optimization. It applies only when Kuasar starts a newly created Sandbox.SandboxWarmPoolbackground refill creates new Sandboxes. Those refill Sandboxes may useSnapStartduring their initial startup and then enter the warm pool as ordinary ready slots.
After reviewing the overlap, I plan to remove SnapStart.spec.snapStartWarmPool from the proposal and CRD. A separate Kuasar-layer pool of pre-restored instances would duplicate SandboxWarmPool capacity management while introducing additional precedence, quota, reclamation, observability, and failure-recovery semantics. The expected value does not justify that complexity.
The simplified model is:
SandboxWarmPoolhit: reuse an already-running Sandbox without another restore.- Direct Sandbox creation or
SandboxWarmPoolrefill: Kuasar attemptsSnapStartrestore and falls back to cold start when no compatible template is available.
| | Build placement | SnapshotController chooses where to materialize local templates. Phase 1 may use all eligible nodes or a smaller policy such as minimum ready nodes; the API/status must not assume all nodes are always built eagerly. | | ||
| | Restore placement | Workload Manager selects a Ready placement entry and binds the Sandbox to the node that can restore it. Phase 1 selects only entries with `cacheState=LocalReady`; future distributed artifacts may allow lazy pull before restore. | | ||
| | Concurrency | SnapshotController may build or materialize placements in parallel. | | ||
| | Partial failure | If at least one node becomes Ready, the aggregate `SnapStart.status.snapshot.phase` can be `Ready` and `activeMode=Snapshot`; failed nodes are reported through `Degraded` condition and per-node Redis state. | |
There was a problem hiding this comment.
Marking the snapshot ready when only one node is ready ignores the user's placement policy when minReadyNodes > 1.
There was a problem hiding this comment.
Thanks for pointing this out. I will simplify the Phase 1 API and remove spec.placement, including MinimumReady and minReadyNodes.
In Phase 1, SnapshotController builds node-local templates on all eligible nodes. The aggregate snapshot becomes usable when at least one valid node-local template is ready, while missing or failed nodes are reported as degraded and replenished in the background. Operators control the eligible node set through the agentcube.volcano.sh/kuasar-snapstart=true label and runtime scheduling constraints.
More granular placement policies may be introduced later with distributed artifact support, when region, zone, capacity, and lazy materialization semantics are defined.
fa0541e to
565b733
Compare
Signed-off-by: lyuyun <lyuyun068@gmail.com>
565b733 to
7622889
Compare
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR adds the AgentCube SnapStart feature design proposal.
SnapStart is an optional snapshot-based startup acceleration layer for AgentCube session runtimes. Phase 1 focuses on Code Interpreter Fork SnapStart: building a reusable, user-state-free runtime baseline after expensive bootstrap initialization has completed, then restoring new sessions from that baseline instead of cold-starting from the image each time.
The proposal defines the generic control-plane abstraction and the Phase 1 Code Interpreter scope:
SandboxSnapshot: the generic snapshot CRD, withsnapshotMode=Forkfor reusable startup baseline snapshots.SnapshotClass: infrastructure/provider capability selection usingproviderName,supportedSnapshotModes, and node selection.SandboxSnapshotTask: internal node-facing task CRD used bySandboxSnapshotControllerto dispatch snapshot creation to node agents.SnapshotDriver: node-agent-local driver interface for runtime/VMM-specific snapshot creation.SnapshotArtifactManifest,SnapshotArtifactSet, andSnapshotArtifact: internal artifact-store records used to track active and pending snapshot artifact sets.Sandboxthroughagentcube.volcano.sh/snapshot-key.The design keeps business runtime logic outside the generic snapshot controller. Business runtime controllers normalize runtime-specific intent into standard
SandboxTemplateresources, whileSandboxSnapshotControlleronly understandsSandboxSnapshot,SandboxTemplate,Sandbox, nodes, andSandboxSnapshotTask.The proposal also clarifies key Phase 1 boundaries:
snapshotKeyis the AgentCube-level restore reference passed to runtime/provider layers.snapshotHashis a control-plane hash of snapshot inputs that affect compatibility, not a physical artifact digest.The proposal includes an engineering contract for Fork-safe snapshot points. For Code Interpreter, the reusable artifact must be captured only after bootstrap work is complete and before user code, uploaded files, session tokens, workspace contents, request-specific environment, or task context enter the runtime state.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This PR is design-only and does not implement SnapStart runtime behavior.
Key review areas:
SandboxSnapshot/SandboxSnapshotTaskabstraction is sufficient.Does this PR introduce a user-facing change?: