Skip to content

feat: implement snapstart for codeinterpreter#379

Open
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat
Open

feat: implement snapstart for codeinterpreter#379
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat

Conversation

@lyuyun
Copy link
Copy Markdown

@lyuyun lyuyun commented Jun 5, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements the AgentCube SnapStart feature for CodeInterpreter.

SnapStart provides a snapshot-based startup acceleration path for CodeInterpreter sessions. It builds reusable runtime snapshot artifacts and lets new sandboxes restore from an active snapshotKey instead of always cold-starting from the image.

This PR implements the control-plane and node-agent pieces for SnapStart:

  • SandboxSnapshot: the generic snapshot CRD, supporting snapshotMode=Fork and snapshotMode=Resume.
  • SnapshotClass: infrastructure/provider capability selection using providerName, supportedSnapshotModes, and node selection.
  • SandboxSnapshotTask: internal node-facing task CRD used by the snapshot controller to dispatch snapshot creation to node agents.
  • SandboxSnapshotController: workload-manager controller that manages snapshot lifecycle, target node selection, artifact task creation, retry handling, active/pending artifact promotion, and cleanup.
  • SnapshotDriver: node-agent-local driver abstraction for provider-specific snapshot creation.
  • Kuasar snapshot driver wiring in agentd.
  • Redis-backed SnapshotArtifactManifest storage for active and pending snapshot artifact sets.
  • CodeInterpreter restore integration through snapshot metadata passed into session Sandbox creation.

The implementation keeps snapshot orchestration generic. Runtime-specific controllers can express snapshot intent through SandboxSnapshot and SandboxTemplate, while the snapshot controller manages snapshot tasks, artifacts, status, retries, and restore availability.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR follows the SnapStart proposal from #366 and implements the Phase 1 CodeInterpreter SnapStart path.

Key review areas:

  • Whether the SandboxSnapshot / SandboxSnapshotTask API shape matches the proposal.
  • Whether artifact manifest state transitions are correct for active and pending snapshot sets.
  • Whether retry, rebuild, promotion, and cleanup behavior in SandboxSnapshotController is sufficient.
  • Whether agentd snapshot task handling and driver registration are clean enough for future providers.
  • Whether CodeInterpreter restore metadata is attached at the right point during sandbox creation.

Verification:

  • go test -run '^$' ./... passed.
  • Full go test ./... was not runnable in the current sandbox because tests require local socket access, Redis, and kubeconfig.

Does this PR introduce a user-facing change?:

NONE

Copilot AI review requested due to automatic review settings June 5, 2026 09:12
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive sandbox snapshotting capabilities, including new CRDs (SandboxSnapshot, SandboxSnapshotTask, SnapshotClass), a snapshot controller, a node-agent reconciler, and a Redis/Valkey-backed artifact store. The reviewer identified several critical issues: an infinite rebuilding loop in the snapshot controller due to un-reset readiness timestamps, potential resource flooding when using the fallback no-op artifact store, protocol desynchronization in the Kuasar driver from recreating buffered readers, missing connection deadlines on the admin socket, and write amplification from saving manifests inside a loop. Addressing these bugs and optimization opportunities is highly recommended before merging.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +214 to +216
manifest.ActiveSetRef = manifest.PendingSetRef
manifest.PendingSetRef = store.SnapshotArtifactSetRef{}
rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When a background rebuild completes and the pending set is promoted to active, ss.Status.ReadyAt is not reset or updated. In subsequent reconciliations, time.Since(ss.Status.ReadyAt.Time) will still be greater than RebuildAfter.Duration, causing the controller to immediately trigger another background rebuild. This results in an infinite loop of continuous rebuilding and promotion, wasting massive CPU and I/O resources.

To fix this, reset ss.Status.ReadyAt to nil upon promotion so that aggregateAndUpdateStatus updates it to the current time.

			manifest.ActiveSetRef = manifest.PendingSetRef
			manifest.PendingSetRef = store.SnapshotArtifactSetRef{}
			ss.Status.ReadyAt = nil
			rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)

Comment on lines +74 to +79
func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {
return nil, nil
}
func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When no Redis/Valkey backend is configured, NewArtifactStoreFromEnv returns a noopArtifactStore. Since GetManifest silently returns nil, nil and PutManifest returns nil, the snapshot controller will think no active or pending set exists on every reconciliation cycle. It will continuously generate new snapshot keys and create new build Sandboxes and tasks, flooding the Kubernetes cluster with infinite orphaned resources.

To prevent this, make noopArtifactStore return an error from GetManifest and PutManifest so the controller safely fails and backs off, while the session creation path still gracefully falls back to cold start. Note that you will need to import "fmt" in this file.

Suggested change
func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {
return nil, nil
}
func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {
return nil
}
func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) {
return nil, fmt.Errorf("artifact store is not configured")
}
func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error {
return fmt.Errorf("artifact store is not configured")
}

Comment thread pkg/agentd/kuasar_driver.go Outdated
Comment on lines +250 to +264
func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {
// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
line, err := bufio.NewReader(conn).ReadString('\n')
if err != nil {
return nil, fmt.Errorf("read response: %w", err)
}
resp := &kuasarResponse{}
if err := json.Unmarshal([]byte(line), resp); err != nil {
return nil, fmt.Errorf("unmarshal response: %w", err)
}
if resp.Error != "" {
return nil, fmt.Errorf("kuasar error: %s", resp.Error)
}
return resp, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating a new bufio.Reader on every call to readResponse is a classic Go networking bug. Since bufio.Reader buffers data from the underlying connection, any bytes read into the buffer beyond the first delimiter (\n) will be lost when the next call creates a new bufio.Reader. This will cause protocol desynchronization and connection hangs.

To fix this, instantiate a single bufio.Reader once per connection (e.g., in performHandshake) and pass it to readResponse and other helpers.

Suggested change
func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {
// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
line, err := bufio.NewReader(conn).ReadString('\n')
if err != nil {
return nil, fmt.Errorf("read response: %w", err)
}
resp := &kuasarResponse{}
if err := json.Unmarshal([]byte(line), resp); err != nil {
return nil, fmt.Errorf("unmarshal response: %w", err)
}
if resp.Error != "" {
return nil, fmt.Errorf("kuasar error: %s", resp.Error)
}
return resp, nil
}
func (d *KuasarDriver) readResponse(rd *bufio.Reader) (*kuasarResponse, error) {
// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
line, err := rd.ReadString('\n')
if err != nil {
return nil, fmt.Errorf("read response: %w", err)
}
resp := &kuasarResponse{}
if err := json.Unmarshal([]byte(line), resp); err != nil {
return nil, fmt.Errorf("unmarshal response: %w", err)
}
if resp.Error != "" {
return nil, fmt.Errorf("kuasar error: %s", resp.Error)
}
return resp, nil
}

Comment on lines +141 to +148
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
return conn, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The driver connects to the Kuasar admin socket but never sets any read or write deadlines. If the Kuasar daemon hangs or stops responding, the node agent's reconciler thread will block indefinitely on conn.Read or conn.Write, leading to thread starvation and resource leaks.

To prevent this, set a reasonable deadline on the connection right after dialing.

Suggested change
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
return conn, nil
}
func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
dialer := &net.Dialer{}
conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
if err != nil {
return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
}
_ = conn.SetDeadline(time.Now().Add(10 * time.Minute)) // Prevent indefinite hangs
return conn, nil
}

Comment on lines +411 to +433
for _, nodeName := range targetNodes {
if _, covered := coveredNodes[nodeName]; covered {
continue
}
// Create build Sandbox and task for this node.
if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil {
logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName)
continue
}
// Register an artifact entry in Creating state.
artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{
ProviderName: sc.Spec.ProviderName,
NodeName: nodeName,
Phase: store.SnapshotArtifactPhaseCreating,
SnapshotKey: artifactSet.SnapshotKey,
SnapshotHash: artifactSet.SnapshotHash,
})
manifest.ArtifactSets[workingKey] = artifactSet
rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)
if err != nil {
return rawVersion, err
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In reconcileTasksAndArtifacts for Fork mode, the controller saves the manifest to the artifact store (Redis) inside the loop for every single target node. If there are multiple nodes, this causes write amplification and unnecessary transaction conflicts.

To optimize this, batch the new artifact creations and save the manifest once after the loop.

		for _, nodeName := range targetNodes {
			if _, covered := coveredNodes[nodeName]; covered {
				continue
			}
			// Create build Sandbox and task for this node.
			if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil {
				logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName)
				continue
			}
			// Register an artifact entry in Creating state.
			artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{
				ProviderName: sc.Spec.ProviderName,
				NodeName:     nodeName,
				Phase:        store.SnapshotArtifactPhaseCreating,
				SnapshotKey:  artifactSet.SnapshotKey,
				SnapshotHash: artifactSet.SnapshotHash,
			})
		}
		manifest.ArtifactSets[workingKey] = artifactSet
		rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)
		if err != nil {
			return rawVersion, err
		}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a snapshotting subsystem (CRDs + controllers + storage) and wires it into Workload Manager session creation to enable snapshot-based “warm” restores when a suitable snapshot is available.

Changes:

  • Adds runtime CRDs/types for SnapshotClass, SandboxSnapshot, and SandboxSnapshotTask, plus controller implementations in workload-manager and agentd.
  • Implements an artifact manifest store (Redis/Valkey-backed) and lookup logic to inject snapshot restore intent into created Sandboxes.
  • Updates CodeInterpreter reconciliation to always maintain a SandboxTemplate as a Fork snapshot source.

Reviewed changes

Copilot reviewed 18 out of 37 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/workloadmanager/snapshot_restore.go Adds lookup logic to find an active snapshot key for restore intent.
pkg/workloadmanager/snapshot_controller.go Adds controller for managing snapshots, tasks, artifact manifests, and status aggregation.
pkg/workloadmanager/server.go Extends server to accept optional snapshot client + artifact store.
pkg/workloadmanager/handlers.go Injects snapshot restore annotation on sandbox creation when applicable.
pkg/workloadmanager/codeinterpreter_controller.go Always reconciles SandboxTemplate; stops deleting it when WarmPool disabled.
pkg/workloadmanager/artifact_store_init.go Adds env-based ArtifactStore initialization (Redis/Valkey + noop fallback).
pkg/store/artifact_store.go Introduces ArtifactStore interface and manifest data model.
pkg/store/artifact_store_redis.go Adds Redis-backed ArtifactStore with optimistic CAS via WATCH/MULTI.
pkg/apis/runtime/v1alpha1/snapshot_types.go Adds API types/constants and scheme registration for snapshot CRDs.
pkg/agentd/snapshot_task_reconciler.go Adds node-local controller to execute snapshot tasks via drivers.
pkg/agentd/snapshot_driver.go Defines SnapshotDriver interface and request/response model.
pkg/agentd/kuasar_driver.go Adds Kuasar driver skeleton for snapshot creation via admin socket.
manifests/charts/base/crds/*.yaml Adds CRDs for snapshot resources.
cmd/workload-manager/main.go Wires snapshot reconciler + artifact store into manager and server.
cmd/agentd/main.go Registers snapshot API scheme and starts SnapshotTask controller.
go.mod Minor dependency ordering adjustment.
Files not reviewed (19)
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/informers/externalversions/generic.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
  • pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (4)

pkg/workloadmanager/snapshot_controller.go:1

  • In ensureBuildSandboxAndTask, returning early when the task already exists can leave the system stuck if the build Sandbox was deleted (or never created) while the task remains. The agent reconciler will then requeue forever waiting for the target Sandbox to become Running, and the controller will never recreate it because it exits early. Suggested fix: even when the task exists, still ensure the build Sandbox exists (or recreate it if missing), or only early-return when the task is already in a terminal phase.
    pkg/workloadmanager/snapshot_restore.go:1
  • The lookup returns the first Ready Fork snapshot encountered, but List order is not guaranteed. If multiple snapshots exist for the same SandboxTemplate, this can pick an older/incorrect snapshot arbitrarily. Suggested fix: select deterministically (e.g., prefer status.readyAt newest, else metadata.creationTimestamp newest) before returning the snapshot key.
    pkg/workloadmanager/snapshot_controller.go:1
  • Fork promotion requires all node artifacts to be Ready before ActiveSetRef is set. However, status/events and restore lookup semantics elsewhere imply "at least one artifact available" should be sufficient for using snapshot restore intent. As-is, snapshot restore won’t be used until full fleet coverage completes. Suggested fix: either (a) promote to active once any artifact is Ready (and continue building remaining nodes in the background), or (b) allow restore lookup to use the pending set when it has at least one Ready artifact, and update status/event messaging accordingly so semantics are consistent.
    pkg/workloadmanager/snapshot_controller.go:1
  • Hash normalization sorts tolerations only by Key using sort.Slice (unstable). If multiple tolerations share the same key, their relative order can change, leading to hash churn and unnecessary rebuilds. Suggested fix: use sort.SliceStable and/or add tie-breakers over the full toleration tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to guarantee deterministic ordering.

Comment thread pkg/agentd/kuasar_driver.go Outdated
Comment on lines +250 to +252
func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) {
// TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised.
line, err := bufio.NewReader(conn).ReadString('\n')
Comment on lines +110 to +113
snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{
Client: mgr.GetClient(),
ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(),
}
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 5, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.86005% with 1108 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.32%. Comparing base (524e55e) to head (bcd8ad5).
⚠️ Report is 119 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/snapshot_controller.go 0.00% 287 Missing ⚠️
pkg/workloadmanager/snapshot_fork.go 0.00% 265 Missing ⚠️
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go 0.00% 187 Missing ⚠️
pkg/agentd/kuasar_driver.go 0.78% 125 Missing and 1 partial ⚠️
pkg/agentd/snapshot_task_reconciler.go 0.00% 86 Missing ⚠️
pkg/store/artifact_store_redis.go 0.00% 40 Missing ⚠️
pkg/workloadmanager/artifact_store_init.go 0.00% 30 Missing ⚠️
pkg/agentd/node_capability.go 0.00% 27 Missing ⚠️
pkg/agentd/driver_registry.go 8.33% 22 Missing ⚠️
pkg/agentd/snapshot_mode.go 0.00% 17 Missing ⚠️
... and 4 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
- Coverage   47.57%   43.32%   -4.25%     
==========================================
  Files          30       45      +15     
  Lines        2819     4279    +1460     
==========================================
+ Hits         1341     1854     +513     
- Misses       1338     2239     +901     
- Partials      140      186      +46     
Flag Coverage Δ
unittests 43.32% <1.86%> (-4.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lyuyun <lyuyun068@gmail.com>
Copilot AI review requested due to automatic review settings June 5, 2026 11:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 42 changed files in this pull request and generated 1 comment.

Files not reviewed (18)
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/informers/externalversions/generic.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
  • client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
  • client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
  • client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
Comments suppressed due to low confidence (8)

pkg/workloadmanager/snapshot_fork.go:1

  • The sort.Slice comparator is not a strict ordering (it only compares Key). If two tolerations share the same Key, the comparator returns false for both directions, and Go’s (unstable) sort may reorder elements nondeterministically, causing snapshot hashes to flap. Use sort.SliceStable and/or add tie-breakers (e.g., compare Operator, Value, Effect, TolerationSeconds) so the order is deterministic.
    pkg/workloadmanager/snapshot_controller.go:1
  • buildSnapshotKey can easily exceed 63 characters, but snapshotKey is used as a label value (e.g., SnapshotKeyLabelKey) in the Fork handler and controller list filters. Kubernetes label values must be ≤63 chars, otherwise resource creation/list matching will fail. Consider generating a separate label-safe identifier (e.g., sha256(snapshotKey) truncated) for labels and keep the full snapshotKey in spec/annotations, or enforce a bounded snapshotKey format that preserves uniqueness within 63 chars.
    pkg/workloadmanager/snapshot_controller.go:1
  • Unavailable is a defined phase for snapshot artifacts/tasks, but cleanup currently only triggers for Ready and Failed. If a task reaches Unavailable, it will never be deleted and mode-specific cleanup won’t run, which can leak build sandboxes and tasks. Include SnapshotArtifactPhaseUnavailable as a terminal phase here (and in any other terminal-phase checks).
    pkg/agentd/snapshot_task_reconciler.go:1
  • SnapshotArtifactPhaseUnavailable is a defined terminal-ish state but isn’t treated as terminal here. If the node agent (now or later) sets Unavailable, reconcile will continue and may keep attempting driver operations unnecessarily. Treat Unavailable as terminal (or add explicit handling) to match the controller’s phase model.
    pkg/workloadmanager/snapshot_fork.go:1
  • lookupActiveForkSnapshotKey lists all SandboxSnapshot objects in the namespace and scans them on every sandbox creation request. Even with a cached controller-runtime client, this is O(N) per request and can become a hotspot. A concrete improvement is to label SandboxSnapshot with sourceRef.name / mode and list using client.MatchingLabels, or add a cache index on .spec.sourceRef.name and use MatchingFields to avoid scanning unrelated snapshots.
    pkg/workloadmanager/snapshot_controller.go:1
  • This map access doesn’t check existence. If PendingSetRef.SnapshotKey is set but the map entry is missing (store corruption, manual edits, or partial writes), pending becomes a zero-value set which can mask the problem and potentially be promoted depending on handler logic. Use pending, ok := ...; if !ok { ... } and clear PendingSetRef (and persist) or surface a clear error so the controller doesn’t proceed with invalid data.
    pkg/store/artifact_store_redis.go:1
  • The comment about the “version token” doesn’t match the implementation: callers pass the previous raw manifest JSON string (from GetManifest/loadManifest) and PutManifest compares it directly to the current stored string; it’s not a “JSON encoding of the raw string value”. Please update the comment to describe the actual token semantics (raw stored value snapshot) to avoid confusing future maintainers.
    pkg/workloadmanager/snapshot_controller.go:1
  • json.Marshal errors are ignored here. While it “shouldn’t happen” for this struct, returning the marshal error is safer and avoids returning an empty/incorrect version token which could break compare-and-set behavior on subsequent writes.

Comment thread cmd/agentd/main.go
Comment on lines +89 to +92
if err := agentd.AdvertiseDriverCapabilities(ctrl.SetupSignalHandler(), cs, nodeName, registry.Drivers()); err != nil {
fmt.Fprintf(os.Stderr, "unable to advertise driver capabilities: %v\n", err)
os.Exit(1)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants