docs: add agentcube snapstart proposal by lyuyun · Pull Request #366 · volcano-sh/agentcube

lyuyun · 2026-05-28T13:45:49Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR adds the AgentCube SnapStart feature design proposal.

SnapStart is an optional snapshot-based startup acceleration layer for AgentCube session runtimes. Phase 1 focuses on Code Interpreter Fork SnapStart: building a reusable, user-state-free runtime baseline after expensive bootstrap initialization has completed, then restoring new sessions from that baseline instead of cold-starting from the image each time.

The proposal defines the generic control-plane abstraction and the Phase 1 Code Interpreter scope:

SandboxSnapshot: the generic snapshot CRD, with snapshotMode=Fork for reusable startup baseline snapshots.
SnapshotClass: infrastructure/provider capability selection using providerName, supportedSnapshotModes, and node selection.
SandboxSnapshotTask: internal node-facing task CRD used by SandboxSnapshotController to dispatch snapshot creation to node agents.
SnapshotDriver: node-agent-local driver interface for runtime/VMM-specific snapshot creation.
SnapshotArtifactManifest, SnapshotArtifactSet, and SnapshotArtifact: internal artifact-store records used to track active and pending snapshot artifact sets.
standard restore intent on session Sandbox through agentcube.volcano.sh/snapshot-key.

The design keeps business runtime logic outside the generic snapshot controller. Business runtime controllers normalize runtime-specific intent into standard SandboxTemplate resources, while SandboxSnapshotController only understands SandboxSnapshot, SandboxTemplate, Sandbox, nodes, and SandboxSnapshotTask.

The proposal also clarifies key Phase 1 boundaries:

Phase 1 implements Code Interpreter Fork SnapStart.
Resume mode remains an API evolution direction but is outside the SnapStart feature.
AgentCube manages artifact records and restore availability, but does not implement business-level physical artifact GC.
snapshotKey is the AgentCube-level restore reference passed to runtime/provider layers.
snapshotHash is a control-plane hash of snapshot inputs that affect compatibility, not a physical artifact digest.
Secret, ConfigMap, PVC, and projected volume references are included through pod template spec references, but their underlying data content is not dereferenced.

The proposal includes an engineering contract for Fork-safe snapshot points. For Code Interpreter, the reusable artifact must be captured only after bootstrap work is complete and before user code, uploaded files, session tokens, workspace contents, request-specific environment, or task context enter the runtime state.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR is design-only and does not implement SnapStart runtime behavior.

Key review areas:

Whether the SandboxSnapshot / SandboxSnapshotTask abstraction is sufficient.
Whether the Code Interpreter Fork-safe readiness contract is specific enough to guide implementation and testing.

Does this PR introduce a user-facing change?:

NONE

volcano-sh-bot · 2026-05-28T13:45:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

docs/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-05-28T13:46:01Z

Welcome @lyuyun! It looks like this is your first PR to volcano-sh/agentcube 🎉

gemini-code-assist

Code Review

This pull request introduces a comprehensive design proposal for AgentCube SnapStart, aiming to significantly reduce sandbox session startup latency using Kuasar's WarmForkSnapshot mechanism. The proposal details the API design, internal controller implementation, ready-waiting protocols, and a multi-phase evolution plan. The feedback points out a missing Go type definition and constants for PerNodeSnapshotPhase in the API design section, which should be added to ensure completeness.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a new design document proposing the AgentCube SnapStart feature, which leverages Kuasar's WarmForkSnapshot mechanism to reduce sandbox session startup latency from 5–12 s to 0.5–2 s by restoring a runtime from a pre-initialized checkpoint instead of cold-starting.

Changes:

Introduces a dedicated SnapStart CRD (separate from CodeInterpreter/AgentRuntime) with artifact/placement/invalidation semantics and Phase 1 node-local storage backend.
Specifies the ready-waiting protocol (inject socket, GET /runtime/status, PREPARE/COMMIT/STARTED flow, PRNG reseed) and an agentd Kuasar Admin HTTP proxy for Workload Manager.
Lays out a multi-phase plan: Phase 1 Code Interpreter, Phase 2 Browser Agent (BrowserWarmFork with 9 quiescence constraints), and future distributed artifact distribution.

codecov-commenter · 2026-05-28T13:51:16Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.90%. Comparing base (524e55e) to head (7622889).
⚠️ Report is 119 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #366       +/-   ##
===========================================
+ Coverage   47.57%   57.90%   +10.33%     
===========================================
  Files          30       34        +4     
  Lines        2819     3181      +362     
===========================================
+ Hits         1341     1842      +501     
+ Misses       1338     1154      -184     
- Partials      140      185       +45

Flag	Coverage Δ
unittests	`57.90% <ø> (+10.33%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Abhinav-kodes · 2026-05-28T14:39:01Z

Hi @lyuyun, I’m interested in working on this direction as well. I had already opened issue #365 with a Firecracker-based proposal, but after reading your PR I think the Kuasar SnapStart approach looks like a much better fit for AgentCube’s Kubernetes-native architecture.
I’d be happy to collaborate, help refine the design, and take on some implementation work once the direction is confirmed. If useful, I can also align my issue with this proposal so we avoid duplicate effort.

lyuyun · 2026-05-29T10:09:48Z

Hi @lyuyun, I’m interested in working on this direction as well. I had already opened issue #365 with a Firecracker-based proposal, but after reading your PR I think the Kuasar SnapStart approach looks like a much better fit for AgentCube’s Kubernetes-native architecture. I’d be happy to collaborate, help refine the design, and take on some implementation work once the direction is confirmed. If useful, I can also align my issue with this proposal so we avoid duplicate effort.

Hi @Abhinav-kodes, thanks for taking a look and for the thoughtful proposal in #365.
I agree it would be good to avoid duplicate effort and align the two directions. It would be great if you could help review and refine the design. Once the direction is agreed, we can split the implementation into smaller tasks and collaborate from there. Aligning #365 with this proposal sounds good to me.

acsoto · 2026-05-30T06:49:29Z

+| SnapStart | Runtime memory template | Snapshot storage and restore cost | Avoids runtime initialization delay |
+| SnapStart WarmPool | Restored runtime instances | CoW memory and idle CPU | Avoids restore latency too |
+
+In Phase 1, SnapStart and `spec.warmPoolSize` are mutually exclusive. A later phase can define a combined model with clear metrics and cost attribution.


Phase 1 says SnapStart and spec.warmPoolSize are mutually exclusive, but the create path later lets a ready snapshot bypass SandboxClaim and fall back to SandboxWarmPool when unavailable.

Thanks for pointing this out. I will revise the proposal to clarify the relationship between SandboxWarmPool and SnapStart.

They operate at different layers:

CodeInterpreter.spec.warmPoolSize controls AgentCube SandboxWarmPool allocation. When a SandboxClaim hits an existing ready slot, AgentCube binds an already-running Sandbox. No additional Kuasar startup or snapshot restore occurs.

SnapStart is a Kuasar sandbox-start optimization. It applies only when Kuasar starts a newly created Sandbox.

SandboxWarmPool background refill creates new Sandboxes. Those refill Sandboxes may use SnapStart during their initial startup and then enter the warm pool as ordinary ready slots.

After reviewing the overlap, I plan to remove SnapStart.spec.snapStartWarmPool from the proposal and CRD. A separate Kuasar-layer pool of pre-restored instances would duplicate SandboxWarmPool capacity management while introducing additional precedence, quota, reclamation, observability, and failure-recovery semantics. The expected value does not justify that complexity.

The simplified model is:

SandboxWarmPool hit: reuse an already-running Sandbox without another restore.

Direct Sandbox creation or SandboxWarmPool refill: Kuasar attempts SnapStart restore and falls back to cold start when no compatible template is available.

acsoto · 2026-05-30T06:49:38Z

+| Build placement | SnapshotController chooses where to materialize local templates. Phase 1 may use all eligible nodes or a smaller policy such as minimum ready nodes; the API/status must not assume all nodes are always built eagerly. |
+| Restore placement | Workload Manager selects a Ready placement entry and binds the Sandbox to the node that can restore it. Phase 1 selects only entries with `cacheState=LocalReady`; future distributed artifacts may allow lazy pull before restore. |
+| Concurrency | SnapshotController may build or materialize placements in parallel. |
+| Partial failure | If at least one node becomes Ready, the aggregate `SnapStart.status.snapshot.phase` can be `Ready` and `activeMode=Snapshot`; failed nodes are reported through `Degraded` condition and per-node Redis state. |


Marking the snapshot ready when only one node is ready ignores the user's placement policy when minReadyNodes > 1.

Thanks for pointing this out. I will simplify the Phase 1 API and remove spec.placement, including MinimumReady and minReadyNodes.

In Phase 1, SnapshotController builds node-local templates on all eligible nodes. The aggregate snapshot becomes usable when at least one valid node-local template is ready, while missing or failed nodes are reported as degraded and replenished in the background. Operators control the eligible node set through the agentcube.volcano.sh/kuasar-snapstart=true label and runtime scheduling constraints.

More granular placement policies may be introduced later with distributed artifact support, when region, zone, capacity, and lazy materialization semantics are defined.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Signed-off-by: lyuyun <lyuyun068@gmail.com>

Copilot AI review requested due to automatic review settings May 28, 2026 13:45

volcano-sh-bot requested review from acsoto and hzxuzhonghu May 28, 2026 13:45

volcano-sh-bot added the size/XXL label May 28, 2026

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread docs/design/agentcube-snapstart-proposal.md Outdated

Copilot AI reviewed May 28, 2026

View reviewed changes

RainbowMango reviewed May 29, 2026

View reviewed changes

Comment thread docs/design/agentcube-snapstart-proposal.md Outdated

acsoto reviewed May 30, 2026

View reviewed changes

acsoto mentioned this pull request May 30, 2026

[Benchmarks] AgentCube SnapStart Validation for Agentic RL Rollouts #365

Open

Copilot AI review requested due to automatic review settings June 1, 2026 02:16

lyuyun force-pushed the snapstart-proposal branch from fa0541e to 565b733 Compare June 1, 2026 02:16

Copilot AI reviewed Jun 1, 2026

View reviewed changes

docs: add agentcube snapstart proposal

7622889

Signed-off-by: lyuyun <lyuyun068@gmail.com>

lyuyun force-pushed the snapstart-proposal branch from 565b733 to 7622889 Compare June 4, 2026 14:57

lyuyun mentioned this pull request Jun 5, 2026

feat: implement snapstart for codeinterpreter #379

Open

Conversation

lyuyun commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

volcano-sh-bot commented May 28, 2026

Uh oh!

volcano-sh-bot commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Abhinav-kodes commented May 28, 2026

Uh oh!

Uh oh!

lyuyun commented May 29, 2026

Uh oh!

acsoto May 30, 2026

Choose a reason for hiding this comment

Uh oh!

lyuyun May 30, 2026

Choose a reason for hiding this comment

Uh oh!

acsoto May 30, 2026

Choose a reason for hiding this comment

Uh oh!

lyuyun May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lyuyun commented May 28, 2026 •

edited

Loading

codecov-commenter commented May 28, 2026 •

edited

Loading

lyuyun May 30, 2026 •

edited

Loading