feat: implement snapstart for codeinterpreter by lyuyun · Pull Request #379 · volcano-sh/agentcube

lyuyun · 2026-06-05T09:12:05Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements the AgentCube SnapStart feature for CodeInterpreter.

SnapStart provides a snapshot-based startup acceleration path for CodeInterpreter sessions. It builds reusable runtime snapshot artifacts and lets new sandboxes restore from an active snapshotKey instead of always cold-starting from the image.

This PR implements the control-plane and node-agent pieces for SnapStart:

SandboxSnapshot: the generic snapshot CRD, supporting snapshotMode=Fork and snapshotMode=Resume.
SnapshotClass: infrastructure/provider capability selection using providerName, supportedSnapshotModes, and node selection.
SandboxSnapshotTask: internal node-facing task CRD used by the snapshot controller to dispatch snapshot creation to node agents.
SandboxSnapshotController: workload-manager controller that manages snapshot lifecycle, target node selection, artifact task creation, retry handling, active/pending artifact promotion, and cleanup.
SnapshotDriver: node-agent-local driver abstraction for provider-specific snapshot creation.
Kuasar snapshot driver wiring in agentd.
Redis-backed SnapshotArtifactManifest storage for active and pending snapshot artifact sets.
CodeInterpreter restore integration through snapshot metadata passed into session Sandbox creation.

The implementation keeps snapshot orchestration generic. Runtime-specific controllers can express snapshot intent through SandboxSnapshot and SandboxTemplate, while the snapshot controller manages snapshot tasks, artifacts, status, retries, and restore availability.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR follows the SnapStart proposal from #366 and implements the Phase 1 CodeInterpreter SnapStart path.

Key review areas:

Whether the SandboxSnapshot / SandboxSnapshotTask API shape matches the proposal.
Whether artifact manifest state transitions are correct for active and pending snapshot sets.
Whether retry, rebuild, promotion, and cleanup behavior in SandboxSnapshotController is sufficient.
Whether agentd snapshot task handling and driver registration are clean enough for future providers.
Whether CodeInterpreter restore metadata is attached at the right point during sandbox creation.

Verification:

go test -run '^$' ./... passed.
Full go test ./... was not runnable in the current sandbox because tests require local socket access, Redis, and kubeconfig.

Does this PR introduce a user-facing change?:

NONE

volcano-sh-bot · 2026-06-05T09:12:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request introduces comprehensive sandbox snapshotting capabilities, including new CRDs (SandboxSnapshot, SandboxSnapshotTask, SnapshotClass), a snapshot controller, a node-agent reconciler, and a Redis/Valkey-backed artifact store. The reviewer identified several critical issues: an infinite rebuilding loop in the snapshot controller due to un-reset readiness timestamps, potential resource flooding when using the fallback no-op artifact store, protocol desynchronization in the Kuasar driver from recreating buffered readers, missing connection deadlines on the admin socket, and write amplification from saving manifests inside a loop. Addressing these bugs and optimization opportunities is highly recommended before merging.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-05T09:14:33Z

+			manifest.ActiveSetRef = manifest.PendingSetRef
+			manifest.PendingSetRef = store.SnapshotArtifactSetRef{}
+			rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)


When a background rebuild completes and the pending set is promoted to active, ss.Status.ReadyAt is not reset or updated. In subsequent reconciliations, time.Since(ss.Status.ReadyAt.Time) will still be greater than RebuildAfter.Duration, causing the controller to immediately trigger another background rebuild. This results in an infinite loop of continuous rebuilding and promotion, wasting massive CPU and I/O resources.

To fix this, reset ss.Status.ReadyAt to nil upon promotion so that aggregateAndUpdateStatus updates it to the current time.

manifest.ActiveSetRef = manifest.PendingSetRef manifest.PendingSetRef = store.SnapshotArtifactSetRef{} ss.Status.ReadyAt = nil rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)

gemini-code-assist · 2026-06-05T09:14:33Z

+func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {
+	return nil, nil
+}
+func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {
+	return nil
+}


When no Redis/Valkey backend is configured, NewArtifactStoreFromEnv returns a noopArtifactStore. Since GetManifest silently returns nil, nil and PutManifest returns nil, the snapshot controller will think no active or pending set exists on every reconciliation cycle. It will continuously generate new snapshot keys and create new build Sandboxes and tasks, flooding the Kubernetes cluster with infinite orphaned resources.

To prevent this, make noopArtifactStore return an error from GetManifest and PutManifest so the controller safely fails and backs off, while the session creation path still gracefully falls back to cold start. Note that you will need to import "fmt" in this file.

Suggested change

func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {

return nil, nil

}

func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {

return nil

}

func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {

return nil, fmt.Errorf("artifact store is not configured")

}

func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {

return fmt.Errorf("artifact store is not configured")

}

gemini-code-assist · 2026-06-05T09:14:34Z

+func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {
+	// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
+	line, err := bufio.NewReader(conn).ReadString('\n')
+	if err != nil {
+		return nil, fmt.Errorf("read response: %w", err)
+	}
+	resp := &kuasarResponse{}
+	if err := json.Unmarshal([]byte(line), resp); err != nil {
+		return nil, fmt.Errorf("unmarshal response: %w", err)
+	}
+	if resp.Error != "" {
+		return nil, fmt.Errorf("kuasar error: %s", resp.Error)
+	}
+	return resp, nil
+}


Creating a new bufio.Reader on every call to readResponse is a classic Go networking bug. Since bufio.Reader buffers data from the underlying connection, any bytes read into the buffer beyond the first delimiter (\n) will be lost when the next call creates a new bufio.Reader. This will cause protocol desynchronization and connection hangs.

To fix this, instantiate a single bufio.Reader once per connection (e.g., in performHandshake) and pass it to readResponse and other helpers.

Suggested change

func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {

// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.

line, err := bufio.NewReader(conn).ReadString('\n')

if err != nil {

return nil, fmt.Errorf("read response: %w", err)

}

resp := &kuasarResponse{}

if err := json.Unmarshal([]byte(line), resp); err != nil {

return nil, fmt.Errorf("unmarshal response: %w", err)

}

if resp.Error != "" {

return nil, fmt.Errorf("kuasar error: %s", resp.Error)

}

return resp, nil

}

func (d *KuasarDriver) readResponse(rd *bufio.Reader) (*kuasarResponse, error) {

// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.

line, err := rd.ReadString('\n')

if err != nil {

return nil, fmt.Errorf("read response: %w", err)

}

resp := &kuasarResponse{}

if err := json.Unmarshal([]byte(line), resp); err != nil {

return nil, fmt.Errorf("unmarshal response: %w", err)

}

if resp.Error != "" {

return nil, fmt.Errorf("kuasar error: %s", resp.Error)

}

return resp, nil

}

gemini-code-assist · 2026-06-05T09:14:34Z

+func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
+	dialer := &net.Dialer{}
+	conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
+	if err != nil {
+		return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
+	}
+	return conn, nil
+}


The driver connects to the Kuasar admin socket but never sets any read or write deadlines. If the Kuasar daemon hangs or stops responding, the node agent's reconciler thread will block indefinitely on conn.Read or conn.Write, leading to thread starvation and resource leaks.

To prevent this, set a reasonable deadline on the connection right after dialing.

Suggested change

func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {

dialer := &net.Dialer{}

conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)

if err != nil {

return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)

}

return conn, nil

}

func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {

dialer := &net.Dialer{}

conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)

if err != nil {

return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)

}

_ = conn.SetDeadline(time.Now().Add(10 * time.Minute)) // Prevent indefinite hangs

return conn, nil

}

gemini-code-assist · 2026-06-05T09:14:34Z

+		for _, nodeName := range targetNodes {
+			if _, covered := coveredNodes[nodeName]; covered {
+				continue
+			}
+			// Create build Sandbox and task for this node.
+			if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil {
+				logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName)
+				continue
+			}
+			// Register an artifact entry in Creating state.
+			artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{
+				ProviderName: sc.Spec.ProviderName,
+				NodeName:     nodeName,
+				Phase:        store.SnapshotArtifactPhaseCreating,
+				SnapshotKey:  artifactSet.SnapshotKey,
+				SnapshotHash: artifactSet.SnapshotHash,
+			})
+			manifest.ArtifactSets[workingKey] = artifactSet
+			rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)
+			if err != nil {
+				return rawVersion, err
+			}
+		}


In reconcileTasksAndArtifacts for Fork mode, the controller saves the manifest to the artifact store (Redis) inside the loop for every single target node. If there are multiple nodes, this causes write amplification and unnecessary transaction conflicts.

To optimize this, batch the new artifact creations and save the manifest once after the loop.

for _, nodeName := range targetNodes { if _, covered := coveredNodes[nodeName]; covered { continue } // Create build Sandbox and task for this node. if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil { logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName) continue } // Register an artifact entry in Creating state. artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{ ProviderName: sc.Spec.ProviderName, NodeName: nodeName, Phase: store.SnapshotArtifactPhaseCreating, SnapshotKey: artifactSet.SnapshotKey, SnapshotHash: artifactSet.SnapshotHash, }) } manifest.ArtifactSets[workingKey] = artifactSet rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion) if err != nil { return rawVersion, err }

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a snapshotting subsystem (CRDs + controllers + storage) and wires it into Workload Manager session creation to enable snapshot-based “warm” restores when a suitable snapshot is available.

Changes:

Adds runtime CRDs/types for SnapshotClass, SandboxSnapshot, and SandboxSnapshotTask, plus controller implementations in workload-manager and agentd.
Implements an artifact manifest store (Redis/Valkey-backed) and lookup logic to inject snapshot restore intent into created Sandboxes.
Updates CodeInterpreter reconciliation to always maintain a SandboxTemplate as a Fork snapshot source.

Reviewed changes

Copilot reviewed 18 out of 37 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/workloadmanager/snapshot_restore.go	Adds lookup logic to find an active snapshot key for restore intent.
pkg/workloadmanager/snapshot_controller.go	Adds controller for managing snapshots, tasks, artifact manifests, and status aggregation.
pkg/workloadmanager/server.go	Extends server to accept optional snapshot client + artifact store.
pkg/workloadmanager/handlers.go	Injects snapshot restore annotation on sandbox creation when applicable.
pkg/workloadmanager/codeinterpreter_controller.go	Always reconciles SandboxTemplate; stops deleting it when WarmPool disabled.
pkg/workloadmanager/artifact_store_init.go	Adds env-based ArtifactStore initialization (Redis/Valkey + noop fallback).
pkg/store/artifact_store.go	Introduces ArtifactStore interface and manifest data model.
pkg/store/artifact_store_redis.go	Adds Redis-backed ArtifactStore with optimistic CAS via WATCH/MULTI.
pkg/apis/runtime/v1alpha1/snapshot_types.go	Adds API types/constants and scheme registration for snapshot CRDs.
pkg/agentd/snapshot_task_reconciler.go	Adds node-local controller to execute snapshot tasks via drivers.
pkg/agentd/snapshot_driver.go	Defines SnapshotDriver interface and request/response model.
pkg/agentd/kuasar_driver.go	Adds Kuasar driver skeleton for snapshot creation via admin socket.
manifests/charts/base/crds/*.yaml	Adds CRDs for snapshot resources.
cmd/workload-manager/main.go	Wires snapshot reconciler + artifact store into manager and server.
cmd/agentd/main.go	Registers snapshot API scheme and starts SnapshotTask controller.
go.mod	Minor dependency ordering adjustment.

Files not reviewed (19)

client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/informers/externalversions/generic.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Language not supported

Comments suppressed due to low confidence (4)

pkg/workloadmanager/snapshot_controller.go:1

In ensureBuildSandboxAndTask, returning early when the task already exists can leave the system stuck if the build Sandbox was deleted (or never created) while the task remains. The agent reconciler will then requeue forever waiting for the target Sandbox to become Running, and the controller will never recreate it because it exits early. Suggested fix: even when the task exists, still ensure the build Sandbox exists (or recreate it if missing), or only early-return when the task is already in a terminal phase.
pkg/workloadmanager/snapshot_restore.go:1
The lookup returns the first Ready Fork snapshot encountered, but List order is not guaranteed. If multiple snapshots exist for the same SandboxTemplate, this can pick an older/incorrect snapshot arbitrarily. Suggested fix: select deterministically (e.g., prefer status.readyAt newest, else metadata.creationTimestamp newest) before returning the snapshot key.
pkg/workloadmanager/snapshot_controller.go:1
Fork promotion requires all node artifacts to be Ready before ActiveSetRef is set. However, status/events and restore lookup semantics elsewhere imply "at least one artifact available" should be sufficient for using snapshot restore intent. As-is, snapshot restore won’t be used until full fleet coverage completes. Suggested fix: either (a) promote to active once any artifact is Ready (and continue building remaining nodes in the background), or (b) allow restore lookup to use the pending set when it has at least one Ready artifact, and update status/event messaging accordingly so semantics are consistent.
pkg/workloadmanager/snapshot_controller.go:1
Hash normalization sorts tolerations only by Key using sort.Slice (unstable). If multiple tolerations share the same key, their relative order can change, leading to hash churn and unnecessary rebuilds. Suggested fix: use sort.SliceStable and/or add tie-breakers over the full toleration tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to guarantee deterministic ordering.

+func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {
+	// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
+	line, err := bufio.NewReader(conn).ReadString('\n')


+	snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{
+		Client:        mgr.GetClient(),
+		ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(),
+	}


codecov-commenter · 2026-06-05T09:16:50Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.86005% with 1108 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.32%. Comparing base (524e55e) to head (bcd8ad5).
⚠️ Report is 119 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloadmanager/snapshot_controller.go	0.00%	287 Missing ⚠️
pkg/workloadmanager/snapshot_fork.go	0.00%	265 Missing ⚠️
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go	0.00%	187 Missing ⚠️
pkg/agentd/kuasar_driver.go	0.78%	125 Missing and 1 partial ⚠️
pkg/agentd/snapshot_task_reconciler.go	0.00%	86 Missing ⚠️
pkg/store/artifact_store_redis.go	0.00%	40 Missing ⚠️
pkg/workloadmanager/artifact_store_init.go	0.00%	30 Missing ⚠️
pkg/agentd/node_capability.go	0.00%	27 Missing ⚠️
pkg/agentd/driver_registry.go	8.33%	22 Missing ⚠️
pkg/agentd/snapshot_mode.go	0.00%	17 Missing ⚠️
... and 4 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
- Coverage   47.57%   43.32%   -4.25%     
==========================================
  Files          30       45      +15     
  Lines        2819     4279    +1460     
==========================================
+ Hits         1341     1854     +513     
- Misses       1338     2239     +901     
- Partials      140      186      +46

Flag	Coverage Δ
unittests	`43.32% <1.86%> (-4.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lyuyun <lyuyun068@gmail.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 42 changed files in this pull request and generated 1 comment.

Files not reviewed (18)

client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/informers/externalversions/generic.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported

Comments suppressed due to low confidence (8)

pkg/workloadmanager/snapshot_fork.go:1

The sort.Slice comparator is not a strict ordering (it only compares Key). If two tolerations share the same Key, the comparator returns false for both directions, and Go’s (unstable) sort may reorder elements nondeterministically, causing snapshot hashes to flap. Use sort.SliceStable and/or add tie-breakers (e.g., compare Operator, Value, Effect, TolerationSeconds) so the order is deterministic.
pkg/workloadmanager/snapshot_controller.go:1
buildSnapshotKey can easily exceed 63 characters, but snapshotKey is used as a label value (e.g., SnapshotKeyLabelKey) in the Fork handler and controller list filters. Kubernetes label values must be ≤63 chars, otherwise resource creation/list matching will fail. Consider generating a separate label-safe identifier (e.g., sha256(snapshotKey) truncated) for labels and keep the full snapshotKey in spec/annotations, or enforce a bounded snapshotKey format that preserves uniqueness within 63 chars.
pkg/workloadmanager/snapshot_controller.go:1
Unavailable is a defined phase for snapshot artifacts/tasks, but cleanup currently only triggers for Ready and Failed. If a task reaches Unavailable, it will never be deleted and mode-specific cleanup won’t run, which can leak build sandboxes and tasks. Include SnapshotArtifactPhaseUnavailable as a terminal phase here (and in any other terminal-phase checks).
pkg/agentd/snapshot_task_reconciler.go:1
SnapshotArtifactPhaseUnavailable is a defined terminal-ish state but isn’t treated as terminal here. If the node agent (now or later) sets Unavailable, reconcile will continue and may keep attempting driver operations unnecessarily. Treat Unavailable as terminal (or add explicit handling) to match the controller’s phase model.
pkg/workloadmanager/snapshot_fork.go:1
lookupActiveForkSnapshotKey lists all SandboxSnapshot objects in the namespace and scans them on every sandbox creation request. Even with a cached controller-runtime client, this is O(N) per request and can become a hotspot. A concrete improvement is to label SandboxSnapshot with sourceRef.name / mode and list using client.MatchingLabels, or add a cache index on .spec.sourceRef.name and use MatchingFields to avoid scanning unrelated snapshots.
pkg/workloadmanager/snapshot_controller.go:1
This map access doesn’t check existence. If PendingSetRef.SnapshotKey is set but the map entry is missing (store corruption, manual edits, or partial writes), pending becomes a zero-value set which can mask the problem and potentially be promoted depending on handler logic. Use pending, ok := ...; if !ok { ... } and clear PendingSetRef (and persist) or surface a clear error so the controller doesn’t proceed with invalid data.
pkg/store/artifact_store_redis.go:1
The comment about the “version token” doesn’t match the implementation: callers pass the previous raw manifest JSON string (from GetManifest/loadManifest) and PutManifest compares it directly to the current stored string; it’s not a “JSON encoding of the raw string value”. Please update the comment to describe the actual token semantics (raw stored value snapshot) to avoid confusing future maintainers.
pkg/workloadmanager/snapshot_controller.go:1
json.Marshal errors are ignored here. While it “shouldn’t happen” for this struct, returning the marshal error is safer and avoids returning an empty/incorrect version token which could break compare-and-set behavior on subsequent writes.

+	if err := agentd.AdvertiseDriverCapabilities(ctrl.SetupSignalHandler(), cs, nodeName, registry.Drivers()); err != nil {
+		fmt.Fprintf(os.Stderr, "unable to advertise driver capabilities: %v\n", err)
+		os.Exit(1)
+	}


Copilot AI review requested due to automatic review settings June 5, 2026 09:12

volcano-sh-bot added the kind/feature label Jun 5, 2026

volcano-sh-bot requested review from LiZhenCheng9527, VanderChen, YaoZengzeng, acsoto and hzxuzhonghu June 5, 2026 09:12

volcano-sh-bot added the size/XXL label Jun 5, 2026

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

Copilot AI reviewed Jun 5, 2026

View reviewed changes

lyuyun force-pushed the snapstart-feat branch from 27ac27e to 3e265ac Compare June 5, 2026 09:44

feat: implement snapstart for codeinterpreter

bcd8ad5

Signed-off-by: lyuyun <lyuyun068@gmail.com>

Copilot AI review requested due to automatic review settings June 5, 2026 11:14

lyuyun force-pushed the snapstart-feat branch from 3e265ac to bcd8ad5 Compare June 5, 2026 11:14

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread cmd/agentd/main.go

Comment on lines +89 to +92

if err := agentd.AdvertiseDriverCapabilities(ctrl.SetupSignalHandler(), cs, nodeName, registry.Drivers()); err != nil {

fmt.Fprintf(os.Stderr, "unable to advertise driver capabilities: %v\n", err)

os.Exit(1)

}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement snapstart for codeinterpreter#379

feat: implement snapstart for codeinterpreter#379
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat

lyuyun commented Jun 5, 2026

Uh oh!

volcano-sh-bot commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lyuyun commented Jun 5, 2026

Uh oh!

volcano-sh-bot commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 5, 2026 •

edited

Loading