Skip to content

docs: add agentcube snapstart proposal#366

Open
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-proposal
Open

docs: add agentcube snapstart proposal#366
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-proposal

Conversation

@lyuyun
Copy link
Copy Markdown

@lyuyun lyuyun commented May 28, 2026

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR adds the AgentCube SnapStart feature design proposal.

SnapStart is an optional snapshot-based startup acceleration layer for AgentCube session runtimes. Phase 1 focuses on Code Interpreter Fork SnapStart: building a reusable, user-state-free runtime baseline after expensive bootstrap initialization has completed, then restoring new sessions from that baseline instead of cold-starting from the image each time.

The proposal defines the generic control-plane abstraction and the Phase 1 Code Interpreter scope:

  • SandboxSnapshot: the generic snapshot CRD, with snapshotMode=Fork for reusable startup baseline snapshots.
  • SnapshotClass: infrastructure/provider capability selection using providerName, supportedSnapshotModes, and node selection.
  • SandboxSnapshotTask: internal node-facing task CRD used by SandboxSnapshotController to dispatch snapshot creation to node agents.
  • SnapshotDriver: node-agent-local driver interface for runtime/VMM-specific snapshot creation.
  • SnapshotArtifactManifest, SnapshotArtifactSet, and SnapshotArtifact: internal artifact-store records used to track active and pending snapshot artifact sets.
  • standard restore intent on session Sandbox through agentcube.volcano.sh/snapshot-key.

The design keeps business runtime logic outside the generic snapshot controller. Business runtime controllers normalize runtime-specific intent into standard SandboxTemplate resources, while SandboxSnapshotController only understands SandboxSnapshot, SandboxTemplate, Sandbox, nodes, and SandboxSnapshotTask.

The proposal also clarifies key Phase 1 boundaries:

  • Phase 1 implements Code Interpreter Fork SnapStart.
  • Resume mode remains an API evolution direction but is outside the SnapStart feature.
  • AgentCube manages artifact records and restore availability, but does not implement business-level physical artifact GC.
  • snapshotKey is the AgentCube-level restore reference passed to runtime/provider layers.
  • snapshotHash is a control-plane hash of snapshot inputs that affect compatibility, not a physical artifact digest.
  • Secret, ConfigMap, PVC, and projected volume references are included through pod template spec references, but their underlying data content is not dereferenced.

The proposal includes an engineering contract for Fork-safe snapshot points. For Code Interpreter, the reusable artifact must be captured only after bootstrap work is complete and before user code, uploaded files, session tokens, workspace contents, request-specific environment, or task context enter the runtime state.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR is design-only and does not implement SnapStart runtime behavior.

Key review areas:

  • Whether the SandboxSnapshot / SandboxSnapshotTask abstraction is sufficient.
  • Whether the Code Interpreter Fork-safe readiness contract is specific enough to guide implementation and testing.

Does this PR introduce a user-facing change?:

NONE

Copilot AI review requested due to automatic review settings May 28, 2026 13:45
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @lyuyun! It looks like this is your first PR to volcano-sh/agentcube 🎉

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for AgentCube SnapStart, aiming to significantly reduce sandbox session startup latency using Kuasar's WarmForkSnapshot mechanism. The proposal details the API design, internal controller implementation, ready-waiting protocols, and a multi-phase evolution plan. The feedback points out a missing Go type definition and constants for PerNodeSnapshotPhase in the API design section, which should be added to ensure completeness.

Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds a new design document proposing the AgentCube SnapStart feature, which leverages Kuasar's WarmForkSnapshot mechanism to reduce sandbox session startup latency from 5–12 s to 0.5–2 s by restoring a runtime from a pre-initialized checkpoint instead of cold-starting.

Changes:

  • Introduces a dedicated SnapStart CRD (separate from CodeInterpreter/AgentRuntime) with artifact/placement/invalidation semantics and Phase 1 node-local storage backend.
  • Specifies the ready-waiting protocol (inject socket, GET /runtime/status, PREPARE/COMMIT/STARTED flow, PRNG reseed) and an agentd Kuasar Admin HTTP proxy for Workload Manager.
  • Lays out a multi-phase plan: Phase 1 Code Interpreter, Phase 2 Browser Agent (BrowserWarmFork with 9 quiescence constraints), and future distributed artifact distribution.

Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.90%. Comparing base (524e55e) to head (7622889).
⚠️ Report is 119 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #366       +/-   ##
===========================================
+ Coverage   47.57%   57.90%   +10.33%     
===========================================
  Files          30       34        +4     
  Lines        2819     3181      +362     
===========================================
+ Hits         1341     1842      +501     
+ Misses       1338     1154      -184     
- Partials      140      185       +45     
Flag Coverage Δ
unittests 57.90% <ø> (+10.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor

Hi @lyuyun, I’m interested in working on this direction as well. I had already opened issue #365 with a Firecracker-based proposal, but after reading your PR I think the Kuasar SnapStart approach looks like a much better fit for AgentCube’s Kubernetes-native architecture.
I’d be happy to collaborate, help refine the design, and take on some implementation work once the direction is confirmed. If useful, I can also align my issue with this proposal so we avoid duplicate effort.

Comment thread docs/design/agentcube-snapstart-proposal.md Outdated
@lyuyun
Copy link
Copy Markdown
Author

lyuyun commented May 29, 2026

Hi @lyuyun, I’m interested in working on this direction as well. I had already opened issue #365 with a Firecracker-based proposal, but after reading your PR I think the Kuasar SnapStart approach looks like a much better fit for AgentCube’s Kubernetes-native architecture. I’d be happy to collaborate, help refine the design, and take on some implementation work once the direction is confirmed. If useful, I can also align my issue with this proposal so we avoid duplicate effort.

Hi @Abhinav-kodes, thanks for taking a look and for the thoughtful proposal in #365.
I agree it would be good to avoid duplicate effort and align the two directions. It would be great if you could help review and refine the design. Once the direction is agreed, we can split the implementation into smaller tasks and collaborate from there. Aligning #365 with this proposal sounds good to me.

| SnapStart | Runtime memory template | Snapshot storage and restore cost | Avoids runtime initialization delay |
| SnapStart WarmPool | Restored runtime instances | CoW memory and idle CPU | Avoids restore latency too |

In Phase 1, SnapStart and `spec.warmPoolSize` are mutually exclusive. A later phase can define a combined model with clear metrics and cost attribution.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 says SnapStart and spec.warmPoolSize are mutually exclusive, but the create path later lets a ready snapshot bypass SandboxClaim and fall back to SandboxWarmPool when unavailable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I will revise the proposal to clarify the relationship between SandboxWarmPool and SnapStart.

They operate at different layers:

  • CodeInterpreter.spec.warmPoolSize controls AgentCube SandboxWarmPool allocation. When a SandboxClaim hits an existing ready slot, AgentCube binds an already-running Sandbox. No additional Kuasar startup or snapshot restore occurs.
  • SnapStart is a Kuasar sandbox-start optimization. It applies only when Kuasar starts a newly created Sandbox.
  • SandboxWarmPool background refill creates new Sandboxes. Those refill Sandboxes may use SnapStart during their initial startup and then enter the warm pool as ordinary ready slots.

After reviewing the overlap, I plan to remove SnapStart.spec.snapStartWarmPool from the proposal and CRD. A separate Kuasar-layer pool of pre-restored instances would duplicate SandboxWarmPool capacity management while introducing additional precedence, quota, reclamation, observability, and failure-recovery semantics. The expected value does not justify that complexity.

The simplified model is:

  • SandboxWarmPool hit: reuse an already-running Sandbox without another restore.
  • Direct Sandbox creation or SandboxWarmPool refill: Kuasar attempts SnapStart restore and falls back to cold start when no compatible template is available.

| Build placement | SnapshotController chooses where to materialize local templates. Phase 1 may use all eligible nodes or a smaller policy such as minimum ready nodes; the API/status must not assume all nodes are always built eagerly. |
| Restore placement | Workload Manager selects a Ready placement entry and binds the Sandbox to the node that can restore it. Phase 1 selects only entries with `cacheState=LocalReady`; future distributed artifacts may allow lazy pull before restore. |
| Concurrency | SnapshotController may build or materialize placements in parallel. |
| Partial failure | If at least one node becomes Ready, the aggregate `SnapStart.status.snapshot.phase` can be `Ready` and `activeMode=Snapshot`; failed nodes are reported through `Degraded` condition and per-node Redis state. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking the snapshot ready when only one node is ready ignores the user's placement policy when minReadyNodes > 1.

Copy link
Copy Markdown
Author

@lyuyun lyuyun May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I will simplify the Phase 1 API and remove spec.placement, including MinimumReady and minReadyNodes.

In Phase 1, SnapshotController builds node-local templates on all eligible nodes. The aggregate snapshot becomes usable when at least one valid node-local template is ready, while missing or failed nodes are reported as degraded and replenished in the background. Operators control the eligible node set through the agentcube.volcano.sh/kuasar-snapstart=true label and runtime scheduling constraints.

More granular placement policies may be introduced later with distributed artifact support, when region, zone, capacity, and lazy materialization semantics are defined.

Copilot AI review requested due to automatic review settings June 1, 2026 02:16
@lyuyun lyuyun force-pushed the snapstart-proposal branch from fa0541e to 565b733 Compare June 1, 2026 02:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Signed-off-by: lyuyun <lyuyun068@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants