Skip to content

feat: implement snapstart for codeinterpreter#380

Closed
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat-dev
Closed

feat: implement snapstart for codeinterpreter#380
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat-dev

Conversation

@lyuyun

@lyuyun lyuyun commented Jun 5, 2026

Copy link
Copy Markdown

No description provided.

Signed-off-by: lyuyun <lyuyun068@gmail.com>
Copilot AI review requested due to automatic review settings June 5, 2026 10:20
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces SandboxSnapshot support (Fork mode) across the control-plane and node agent, including new CRDs, controllers, artifact-store persistence, and restore-intent injection for session sandboxes.

Changes:

  • Add SnapshotClass/SandboxSnapshot/SandboxSnapshotTask APIs + CRDs, plus controllers in workload-manager and agentd.
  • Implement Fork snapshot mode end-to-end, including per-node build sandboxes, task orchestration, hashing, and artifact manifest management.
  • Add Redis/Valkey-backed artifact store and wire snapshot restore-intent into sandbox creation.

Reviewed changes

Copilot reviewed 22 out of 40 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/workloadmanager/snapshot_mode.go Defines the mode-handler interface for snapshot reconciliation.
pkg/workloadmanager/snapshot_fork.go Implements Fork-mode snapshot orchestration, hashing, and snapshot-key lookup for restore intent.
pkg/workloadmanager/snapshot_controller.go Adds the SandboxSnapshot controller and shared manifest/status utilities.
pkg/workloadmanager/server.go Wires optional snapshot lookup dependencies into the API server.
pkg/workloadmanager/handlers.go Injects snapshot restore intent into sandbox creation (CodeInterpreter path).
pkg/workloadmanager/codeinterpreter_controller.go Always reconciles SandboxTemplate so it can serve as Fork snapshot source.
pkg/workloadmanager/artifact_store_init.go Adds env-based artifact store init (Redis/Valkey) with noop fallback.
pkg/store/artifact_store.go Introduces ArtifactStore interface and snapshot artifact manifest schema.
pkg/store/artifact_store_redis.go Implements ArtifactStore with Redis CAS semantics.
pkg/apis/runtime/v1alpha1/snapshot_types.go Adds SnapshotClass/SandboxSnapshot/SandboxSnapshotTask API types and constants.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Adds generated DeepCopy methods for new API types.
pkg/agentd/snapshot_task_reconciler.go Adds node-agent controller for executing SandboxSnapshotTasks.
pkg/agentd/snapshot_mode.go Adds per-mode target validation (Fork waits for build sandbox readiness).
pkg/agentd/snapshot_driver.go Adds snapshot driver interface and request/response structs.
pkg/agentd/kuasar_driver.go Adds a Kuasar SnapStart driver scaffold (handshake + create/delete/inspect stubs).
manifests/charts/base/crds/runtime.agentcube.volcano.sh_snapshotclasses.yaml Adds SnapshotClass CRD.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshots.yaml Adds SandboxSnapshot CRD.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshottasks.yaml Adds SandboxSnapshotTask CRD.
hack/update-codegen.sh Extends lister-gen workaround to new snapshot resources.
cmd/workload-manager/main.go Registers snapshot controller, initializes artifact store, and passes deps to server.
cmd/agentd/main.go Registers snapshot task controller and adds runtime API scheme.
go.mod Minor ordering adjustment in require block.
Files not reviewed (18)
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/informers/externalversions/generic.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
Comments suppressed due to low confidence (5)

pkg/workloadmanager/snapshot_controller.go:1

  • Completed-task cleanup ignores SnapshotArtifactPhaseUnavailable. Tasks that end as Unavailable won't be cleaned up (and mode-specific cleanup like deleting build sandboxes won't run), causing resource leaks. Include Unavailable among the terminal phases for cleanup.
    pkg/workloadmanager/snapshot_controller.go:1
  • This reads manifest.ArtifactSets[pendingKey] without checking presence. If the manifest is corrupted or the set was deleted (e.g., manual edits / store issues), pending becomes a zero-value set and could be incorrectly promoted or cause confusing logs/status. Guard with an ok check and clear PendingSetRef (or fail status) when the key is missing.
    pkg/workloadmanager/snapshot_fork.go:1
  • Hash normalization sorts tolerations only by Key. If multiple tolerations share a key, the comparator doesn't define a strict total order, and sort.Slice is not stable—this can yield non-deterministic ordering and spurious hash changes/rebuilds. Sort by a full tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to make the ordering deterministic.
    pkg/workloadmanager/snapshot_fork.go:1
  • lookupActiveForkSnapshotKey lists all SandboxSnapshots in the namespace and then potentially calls the artifact store per candidate. This can become a hot-path bottleneck during session creation (list cost + N artifact-store reads). Consider adding a label (e.g., sourceRef name + mode) on SandboxSnapshot objects and using client.MatchingLabels to narrow the list, or maintaining an index/cache for ready snapshots keyed by (namespace, sourceTemplateName).
    pkg/store/artifact_store_redis.go:1
  • The comment about the version token being a 'JSON encoding of the raw string value' doesn't match the implementation: PutManifest directly compares current (raw Redis string) with version (also a raw string). Either update the comment to reflect raw-string comparison, or adjust the code to actually use an encoded token format if that's the intended contract.

Comment on lines +60 to +64
description: SandboxSnapshotMode selects the snapshot usage mode.
enum:
- Fork
- Resume
type: string
Comment on lines +61 to +65
// Skip tasks that have already reached a terminal phase.
if task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseReady ||
task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseFailed {
return ctrl.Result{}, nil
}
Comment on lines +110 to +113
snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{
Client: mgr.GetClient(),
ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(),
}

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sandbox snapshotting feature to Volcano AgentCube, adding CRDs for SandboxSnapshot, SandboxSnapshotTask, and SnapshotClass, along with a node-local Kuasar snapshot driver, a snapshot task reconciler, and a control-plane snapshot controller. It also integrates snapshot restore intent into the workload manager and implements a Redis/Valkey-backed artifact store. The code review highlights several critical issues and improvement opportunities: a potential resource leak of physical artifacts on nodes during promotion, non-deterministic behavior in snapshot lookup and pod spec normalization, and inefficient O(N) slice length calculation. Additionally, feedback suggests improving robustness by aligning connection deadlines with context deadlines in the Kuasar driver and returning the raw manifest string from the artifact store to avoid double-marshaling overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +30 to +32
// GetManifest retrieves the manifest for the given snapshot owner key.
// Returns nil, nil when no manifest exists.
GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

GetManifest does not return the raw version string of the manifest from Redis. This forces loadManifest to use json.Marshal to generate a version string, which is non-deterministic (due to map key ordering/omitted fields) and introduces double-marshaling overhead. Changing GetManifest to return (*SnapshotArtifactManifest, string, error) where the second return value is the raw string retrieved from Redis avoids these issues.

Suggested change
// GetManifest retrieves the manifest for the given snapshot owner key.
// Returns nil, nil when no manifest exists.
GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error)
// GetManifest retrieves the manifest and its raw string version for the given snapshot owner key.
// Returns nil, "", nil when no manifest exists.
GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, string, error)

Comment on lines +122 to +124
if manifest.ActiveSetRef.SnapshotKey != "" {
delete(manifest.ArtifactSets, manifest.ActiveSetRef.SnapshotKey)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When promoting a pending set to active, the old active set is deleted from manifest.ArtifactSets, but the physical snapshot artifacts on the nodes are never deleted, leading to a permanent resource leak on the nodes. Consider introducing a deletion/cleanup task or garbage collection mechanism to remove physical artifacts that are no longer referenced.

Comment on lines +409 to +457
func lookupActiveForkSnapshotKey(
ctx context.Context,
k8sClient client.Client,
artifactStore store.ArtifactStore,
namespace, sandboxTemplateName string,
) string {
snapshotList := &runtimev1alpha1.SandboxSnapshotList{}
if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil {
klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start",
"namespace", namespace, "error", err)
return ""
}

for i := range snapshotList.Items {
ss := &snapshotList.Items[i]
if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork {
continue
}
if ss.Spec.SourceRef.Name != sandboxTemplateName {
continue
}
if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady {
continue
}

ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID))
manifest, err := artifactStore.GetManifest(ctx, ownerKey)
if err != nil {
klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start",
"snapshot", ss.Name, "error", err)
continue
}
if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" {
continue
}
activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey]
if !ok {
continue
}
for _, art := range activeSet.Artifacts {
if art.Phase == store.SnapshotArtifactPhaseReady {
klog.V(4).InfoS("snapshot lookup: found active fork snapshot key",
"snapshot", ss.Name, "snapshotKey", manifest.ActiveSetRef.SnapshotKey)
return manifest.ActiveSetRef.SnapshotKey
}
}
}
return ""
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

lookupActiveForkSnapshotKey returns the first active snapshot key it finds in an arbitrary order. If multiple snapshots exist for the same template, this leads to non-deterministic behavior. Sorting or selecting the snapshot with the latest CreationTimestamp or ReadyAt time ensures deterministic behavior.

func lookupActiveForkSnapshotKey(
	ctx context.Context,
	k8sClient client.Client,
	artifactStore store.ArtifactStore,
	namespace, sandboxTemplateName string,
) string {
	snapshotList := &runtimev1alpha1.SandboxSnapshotList{}
	if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil {
		klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start",
			"namespace", namespace, "error", err)
		return ""
	}

	var bestSnapshot *runtimev1alpha1.SandboxSnapshot
	var bestSnapshotKey string

	for i := range snapshotList.Items {
		ss := &snapshotList.Items[i]
		if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork {
			continue
		}
		if ss.Spec.SourceRef.Name != sandboxTemplateName {
			continue
		}
		if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady {
			continue
		}

		ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID))
		manifest, err := artifactStore.GetManifest(ctx, ownerKey)
		if err != nil {
			klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start",
				"snapshot", ss.Name, "error", err)
			continue
		}
		if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" {
			continue
		}
		activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey]
		if !ok {
			continue
		}
		hasReadyArtifact := false
		for _, art := range activeSet.Artifacts {
			if art.Phase == store.SnapshotArtifactPhaseReady {
				hasReadyArtifact = true
				break
			}
		}
		if !hasReadyArtifact {
			continue
		}

		if bestSnapshot == nil || ss.CreationTimestamp.After(bestSnapshot.CreationTimestamp.Time) {
			bestSnapshot = ss
			bestSnapshotKey = manifest.ActiveSetRef.SnapshotKey
		}
	}

	if bestSnapshotKey != "" {
		klog.V(4).InfoS("snapshot lookup: found active fork snapshot key",
			"snapshot", bestSnapshot.Name, "snapshotKey", bestSnapshotKey)
		return bestSnapshotKey
	}
	return ""
}

Comment on lines +201 to +203
func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error {
ticker := time.NewTicker(kuasarReadinessPollInterval)
defer ticker.Stop()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In waitForReadiness, the connection's deadline is not aligned with the ctx deadline. If ctx times out, the loop might hang for up to an extra minute because readResponse is blocked on a socket read with a longer deadline. Setting the connection's deadline to ctx.Deadline() at the start of waitForReadiness ensures that any blocked read/write inside the loop will unblock immediately when ctx times out.

func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error {
	if deadline, ok := ctx.Deadline(); ok {
		_ = conn.SetDeadline(deadline)
	}
	ticker := time.NewTicker(kuasarReadinessPollInterval)
	defer ticker.Stop()

Comment on lines +140 to +147
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))
return conn, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In dialSocket, the connection's deadline is set to a fixed 6 minutes in the future, ignoring any deadline/timeout on the passed ctx. It is much more robust and idiomatic to respect ctx.Deadline() if it is set.

Suggested change
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))
return conn, nil
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
if deadline, ok := ctx.Deadline(); ok {
_ = conn.SetDeadline(deadline)
} else {
_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))
}
return conn, nil
}

Comment on lines +457 to +466
func int32Len(artifacts []store.SnapshotArtifact) int32 {
count := int32(0)
for range artifacts {
if count == maxInt32Value {
return maxInt32Value
}
count++
}
return count
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

int32Len is implemented as an O(N) loop over the slice, which is highly inefficient and unnecessary. Using int32(len(artifacts)) directly is O(1) and much cleaner.

func int32Len(artifacts []store.SnapshotArtifact) int32 {
	return int32(len(artifacts))
}

Comment on lines +388 to +395
func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec {
s := spec.DeepCopy()
s.NodeName = ""
sort.Slice(s.Tolerations, func(i, j int) bool {
return s.Tolerations[i].Key < s.Tolerations[j].Key
})
return *s
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

normalizePodSpec sorts tolerations only by Key. If multiple tolerations have the same key (or empty keys), the sort order is non-deterministic, which can cause hash mismatches for semantically identical specs. Sorting by Key, Value, and Effect ensures deterministic sorting.

func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec {
	s := spec.DeepCopy()
	s.NodeName = ""
	sort.Slice(s.Tolerations, func(i, j int) bool {
		if s.Tolerations[i].Key != s.Tolerations[j].Key {
			return s.Tolerations[i].Key < s.Tolerations[j].Key
		}
		if s.Tolerations[i].Value != s.Tolerations[j].Value {
			return s.Tolerations[i].Value < s.Tolerations[j].Value
		}
		return string(s.Tolerations[i].Effect) < string(s.Tolerations[j].Effect)
	})
	return *s
}

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.67598% with 1056 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.93%. Comparing base (524e55e) to head (646c785).
⚠️ Report is 119 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/snapshot_controller.go 0.00% 287 Missing ⚠️
pkg/workloadmanager/snapshot_fork.go 0.00% 265 Missing ⚠️
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go 0.00% 187 Missing ⚠️
pkg/agentd/kuasar_driver.go 0.00% 123 Missing ⚠️
pkg/agentd/snapshot_task_reconciler.go 0.00% 86 Missing ⚠️
pkg/store/artifact_store_redis.go 0.00% 40 Missing ⚠️
pkg/workloadmanager/artifact_store_init.go 0.00% 30 Missing ⚠️
pkg/agentd/snapshot_mode.go 0.00% 17 Missing ⚠️
pkg/workloadmanager/handlers.go 60.86% 9 Missing ⚠️
pkg/workloadmanager/codeinterpreter_controller.go 0.00% 6 Missing ⚠️
... and 2 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #380      +/-   ##
==========================================
- Coverage   47.57%   43.93%   -3.64%     
==========================================
  Files          30       43      +13     
  Lines        2819     4224    +1405     
==========================================
+ Hits         1341     1856     +515     
- Misses       1338     2184     +846     
- Partials      140      184      +44     
Flag Coverage Δ
unittests 43.93% <1.67%> (-3.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lyuyun lyuyun closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants