feat: implement snapstart for codeinterpreter#379
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive sandbox snapshotting capabilities, including new CRDs (SandboxSnapshot, SandboxSnapshotTask, SnapshotClass), a snapshot controller, a node-agent reconciler, and a Redis/Valkey-backed artifact store. The reviewer identified several critical issues: an infinite rebuilding loop in the snapshot controller due to un-reset readiness timestamps, potential resource flooding when using the fallback no-op artifact store, protocol desynchronization in the Kuasar driver from recreating buffered readers, missing connection deadlines on the admin socket, and write amplification from saving manifests inside a loop. Addressing these bugs and optimization opportunities is highly recommended before merging.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| manifest.ActiveSetRef = manifest.PendingSetRef | ||
| manifest.PendingSetRef = store.SnapshotArtifactSetRef{} | ||
| rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion) |
There was a problem hiding this comment.
When a background rebuild completes and the pending set is promoted to active, ss.Status.ReadyAt is not reset or updated. In subsequent reconciliations, time.Since(ss.Status.ReadyAt.Time) will still be greater than RebuildAfter.Duration, causing the controller to immediately trigger another background rebuild. This results in an infinite loop of continuous rebuilding and promotion, wasting massive CPU and I/O resources.
To fix this, reset ss.Status.ReadyAt to nil upon promotion so that aggregateAndUpdateStatus updates it to the current time.
manifest.ActiveSetRef = manifest.PendingSetRef
manifest.PendingSetRef = store.SnapshotArtifactSetRef{}
ss.Status.ReadyAt = nil
rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)| func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) { | ||
| return nil, nil | ||
| } | ||
| func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
When no Redis/Valkey backend is configured, NewArtifactStoreFromEnv returns a noopArtifactStore. Since GetManifest silently returns nil, nil and PutManifest returns nil, the snapshot controller will think no active or pending set exists on every reconciliation cycle. It will continuously generate new snapshot keys and create new build Sandboxes and tasks, flooding the Kubernetes cluster with infinite orphaned resources.
To prevent this, make noopArtifactStore return an error from GetManifest and PutManifest so the controller safely fails and backs off, while the session creation path still gracefully falls back to cold start. Note that you will need to import "fmt" in this file.
| func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) { | |
| return nil, nil | |
| } | |
| func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error { | |
| return nil | |
| } | |
| func (n *noopArtifactStore) GetManifest(_ context.Context, _ string) (*store.SnapshotArtifactManifest, error) { | |
| return nil, fmt.Errorf("artifact store is not configured") | |
| } | |
| func (n *noopArtifactStore) PutManifest(_ context.Context, _ string, _ *store.SnapshotArtifactManifest, _ string) error { | |
| return fmt.Errorf("artifact store is not configured") | |
| } |
| func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) { | ||
| // TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised. | ||
| line, err := bufio.NewReader(conn).ReadString('\n') | ||
| if err != nil { | ||
| return nil, fmt.Errorf("read response: %w", err) | ||
| } | ||
| resp := &kuasarResponse{} | ||
| if err := json.Unmarshal([]byte(line), resp); err != nil { | ||
| return nil, fmt.Errorf("unmarshal response: %w", err) | ||
| } | ||
| if resp.Error != "" { | ||
| return nil, fmt.Errorf("kuasar error: %s", resp.Error) | ||
| } | ||
| return resp, nil | ||
| } |
There was a problem hiding this comment.
Creating a new bufio.Reader on every call to readResponse is a classic Go networking bug. Since bufio.Reader buffers data from the underlying connection, any bytes read into the buffer beyond the first delimiter (\n) will be lost when the next call creates a new bufio.Reader. This will cause protocol desynchronization and connection hangs.
To fix this, instantiate a single bufio.Reader once per connection (e.g., in performHandshake) and pass it to readResponse and other helpers.
| func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) { | |
| // TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised. | |
| line, err := bufio.NewReader(conn).ReadString('\n') | |
| if err != nil { | |
| return nil, fmt.Errorf("read response: %w", err) | |
| } | |
| resp := &kuasarResponse{} | |
| if err := json.Unmarshal([]byte(line), resp); err != nil { | |
| return nil, fmt.Errorf("unmarshal response: %w", err) | |
| } | |
| if resp.Error != "" { | |
| return nil, fmt.Errorf("kuasar error: %s", resp.Error) | |
| } | |
| return resp, nil | |
| } | |
| func (d *KuasarDriver) readResponse(rd *bufio.Reader) (*kuasarResponse, error) { | |
| // TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised. | |
| line, err := rd.ReadString('\n') | |
| if err != nil { | |
| return nil, fmt.Errorf("read response: %w", err) | |
| } | |
| resp := &kuasarResponse{} | |
| if err := json.Unmarshal([]byte(line), resp); err != nil { | |
| return nil, fmt.Errorf("unmarshal response: %w", err) | |
| } | |
| if resp.Error != "" { | |
| return nil, fmt.Errorf("kuasar error: %s", resp.Error) | |
| } | |
| return resp, nil | |
| } |
| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | ||
| dialer := &net.Dialer{} | ||
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | ||
| } | ||
| return conn, nil | ||
| } |
There was a problem hiding this comment.
The driver connects to the Kuasar admin socket but never sets any read or write deadlines. If the Kuasar daemon hangs or stops responding, the node agent's reconciler thread will block indefinitely on conn.Read or conn.Write, leading to thread starvation and resource leaks.
To prevent this, set a reasonable deadline on the connection right after dialing.
| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | |
| dialer := &net.Dialer{} | |
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | |
| if err != nil { | |
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | |
| } | |
| return conn, nil | |
| } | |
| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | |
| dialer := &net.Dialer{} | |
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | |
| if err != nil { | |
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | |
| } | |
| _ = conn.SetDeadline(time.Now().Add(10 * time.Minute)) // Prevent indefinite hangs | |
| return conn, nil | |
| } |
| for _, nodeName := range targetNodes { | ||
| if _, covered := coveredNodes[nodeName]; covered { | ||
| continue | ||
| } | ||
| // Create build Sandbox and task for this node. | ||
| if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil { | ||
| logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName) | ||
| continue | ||
| } | ||
| // Register an artifact entry in Creating state. | ||
| artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{ | ||
| ProviderName: sc.Spec.ProviderName, | ||
| NodeName: nodeName, | ||
| Phase: store.SnapshotArtifactPhaseCreating, | ||
| SnapshotKey: artifactSet.SnapshotKey, | ||
| SnapshotHash: artifactSet.SnapshotHash, | ||
| }) | ||
| manifest.ArtifactSets[workingKey] = artifactSet | ||
| rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion) | ||
| if err != nil { | ||
| return rawVersion, err | ||
| } | ||
| } |
There was a problem hiding this comment.
In reconcileTasksAndArtifacts for Fork mode, the controller saves the manifest to the artifact store (Redis) inside the loop for every single target node. If there are multiple nodes, this causes write amplification and unnecessary transaction conflicts.
To optimize this, batch the new artifact creations and save the manifest once after the loop.
for _, nodeName := range targetNodes {
if _, covered := coveredNodes[nodeName]; covered {
continue
}
// Create build Sandbox and task for this node.
if err := r.ensureBuildSandboxAndTask(ctx, ss, sc, artifactSet.SnapshotKey, artifactSet.SnapshotHash, nodeName); err != nil {
logger.Error(err, "failed to ensure build sandbox and task", "node", nodeName)
continue
}
// Register an artifact entry in Creating state.
artifactSet.Artifacts = append(artifactSet.Artifacts, store.SnapshotArtifact{
ProviderName: sc.Spec.ProviderName,
NodeName: nodeName,
Phase: store.SnapshotArtifactPhaseCreating,
SnapshotKey: artifactSet.SnapshotKey,
SnapshotHash: artifactSet.SnapshotHash,
})
}
manifest.ArtifactSets[workingKey] = artifactSet
rawVersion, err = r.saveManifest(ctx, ownerKey, manifest, rawVersion)
if err != nil {
return rawVersion, err
}There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces a snapshotting subsystem (CRDs + controllers + storage) and wires it into Workload Manager session creation to enable snapshot-based “warm” restores when a suitable snapshot is available.
Changes:
- Adds runtime CRDs/types for
SnapshotClass,SandboxSnapshot, andSandboxSnapshotTask, plus controller implementations in workload-manager and agentd. - Implements an artifact manifest store (Redis/Valkey-backed) and lookup logic to inject snapshot restore intent into created Sandboxes.
- Updates CodeInterpreter reconciliation to always maintain a
SandboxTemplateas a Fork snapshot source.
Reviewed changes
Copilot reviewed 18 out of 37 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/snapshot_restore.go | Adds lookup logic to find an active snapshot key for restore intent. |
| pkg/workloadmanager/snapshot_controller.go | Adds controller for managing snapshots, tasks, artifact manifests, and status aggregation. |
| pkg/workloadmanager/server.go | Extends server to accept optional snapshot client + artifact store. |
| pkg/workloadmanager/handlers.go | Injects snapshot restore annotation on sandbox creation when applicable. |
| pkg/workloadmanager/codeinterpreter_controller.go | Always reconciles SandboxTemplate; stops deleting it when WarmPool disabled. |
| pkg/workloadmanager/artifact_store_init.go | Adds env-based ArtifactStore initialization (Redis/Valkey + noop fallback). |
| pkg/store/artifact_store.go | Introduces ArtifactStore interface and manifest data model. |
| pkg/store/artifact_store_redis.go | Adds Redis-backed ArtifactStore with optimistic CAS via WATCH/MULTI. |
| pkg/apis/runtime/v1alpha1/snapshot_types.go | Adds API types/constants and scheme registration for snapshot CRDs. |
| pkg/agentd/snapshot_task_reconciler.go | Adds node-local controller to execute snapshot tasks via drivers. |
| pkg/agentd/snapshot_driver.go | Defines SnapshotDriver interface and request/response model. |
| pkg/agentd/kuasar_driver.go | Adds Kuasar driver skeleton for snapshot creation via admin socket. |
| manifests/charts/base/crds/*.yaml | Adds CRDs for snapshot resources. |
| cmd/workload-manager/main.go | Wires snapshot reconciler + artifact store into manager and server. |
| cmd/agentd/main.go | Registers snapshot API scheme and starts SnapshotTask controller. |
| go.mod | Minor dependency ordering adjustment. |
Files not reviewed (19)
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/informers/externalversions/generic.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
- pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (4)
pkg/workloadmanager/snapshot_controller.go:1
- In
ensureBuildSandboxAndTask, returning early when the task already exists can leave the system stuck if the build Sandbox was deleted (or never created) while the task remains. The agent reconciler will then requeue forever waiting for the target Sandbox to become Running, and the controller will never recreate it because it exits early. Suggested fix: even when the task exists, still ensure the build Sandbox exists (or recreate it if missing), or only early-return when the task is already in a terminal phase.
pkg/workloadmanager/snapshot_restore.go:1 - The lookup returns the first Ready Fork snapshot encountered, but
Listorder is not guaranteed. If multiple snapshots exist for the same SandboxTemplate, this can pick an older/incorrect snapshot arbitrarily. Suggested fix: select deterministically (e.g., preferstatus.readyAtnewest, elsemetadata.creationTimestampnewest) before returning the snapshot key.
pkg/workloadmanager/snapshot_controller.go:1 - Fork promotion requires all node artifacts to be Ready before
ActiveSetRefis set. However, status/events and restore lookup semantics elsewhere imply "at least one artifact available" should be sufficient for using snapshot restore intent. As-is, snapshot restore won’t be used until full fleet coverage completes. Suggested fix: either (a) promote to active once any artifact is Ready (and continue building remaining nodes in the background), or (b) allow restore lookup to use the pending set when it has at least one Ready artifact, and update status/event messaging accordingly so semantics are consistent.
pkg/workloadmanager/snapshot_controller.go:1 - Hash normalization sorts tolerations only by
Keyusingsort.Slice(unstable). If multiple tolerations share the same key, their relative order can change, leading to hash churn and unnecessary rebuilds. Suggested fix: usesort.SliceStableand/or add tie-breakers over the full toleration tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to guarantee deterministic ordering.
| func (d *KuasarDriver) readResponse(conn net.Conn) (*kuasarResponse, error) { | ||
| // TODO(maintainer): replace with actual Kuasar wire protocol framing once stabilised. | ||
| line, err := bufio.NewReader(conn).ReadString('\n') |
| snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{ | ||
| Client: mgr.GetClient(), | ||
| ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(), | ||
| } |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #379 +/- ##
==========================================
- Coverage 47.57% 43.32% -4.25%
==========================================
Files 30 45 +15
Lines 2819 4279 +1460
==========================================
+ Hits 1341 1854 +513
- Misses 1338 2239 +901
- Partials 140 186 +46
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: lyuyun <lyuyun068@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 42 changed files in this pull request and generated 1 comment.
Files not reviewed (18)
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/informers/externalversions/generic.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
Comments suppressed due to low confidence (8)
pkg/workloadmanager/snapshot_fork.go:1
- The
sort.Slicecomparator is not a strict ordering (it only comparesKey). If two tolerations share the sameKey, the comparator returns false for both directions, and Go’s (unstable) sort may reorder elements nondeterministically, causing snapshot hashes to flap. Usesort.SliceStableand/or add tie-breakers (e.g., compareOperator,Value,Effect,TolerationSeconds) so the order is deterministic.
pkg/workloadmanager/snapshot_controller.go:1 buildSnapshotKeycan easily exceed 63 characters, butsnapshotKeyis used as a label value (e.g.,SnapshotKeyLabelKey) in the Fork handler and controller list filters. Kubernetes label values must be ≤63 chars, otherwise resource creation/list matching will fail. Consider generating a separate label-safe identifier (e.g.,sha256(snapshotKey)truncated) for labels and keep the fullsnapshotKeyin spec/annotations, or enforce a bounded snapshotKey format that preserves uniqueness within 63 chars.
pkg/workloadmanager/snapshot_controller.go:1Unavailableis a defined phase for snapshot artifacts/tasks, but cleanup currently only triggers forReadyandFailed. If a task reachesUnavailable, it will never be deleted and mode-specific cleanup won’t run, which can leak build sandboxes and tasks. IncludeSnapshotArtifactPhaseUnavailableas a terminal phase here (and in any other terminal-phase checks).
pkg/agentd/snapshot_task_reconciler.go:1SnapshotArtifactPhaseUnavailableis a defined terminal-ish state but isn’t treated as terminal here. If the node agent (now or later) setsUnavailable, reconcile will continue and may keep attempting driver operations unnecessarily. TreatUnavailableas terminal (or add explicit handling) to match the controller’s phase model.
pkg/workloadmanager/snapshot_fork.go:1lookupActiveForkSnapshotKeylists allSandboxSnapshotobjects in the namespace and scans them on every sandbox creation request. Even with a cached controller-runtime client, this is O(N) per request and can become a hotspot. A concrete improvement is to labelSandboxSnapshotwithsourceRef.name/ mode and list usingclient.MatchingLabels, or add a cache index on.spec.sourceRef.nameand useMatchingFieldsto avoid scanning unrelated snapshots.
pkg/workloadmanager/snapshot_controller.go:1- This map access doesn’t check existence. If
PendingSetRef.SnapshotKeyis set but the map entry is missing (store corruption, manual edits, or partial writes),pendingbecomes a zero-value set which can mask the problem and potentially be promoted depending on handler logic. Usepending, ok := ...; if !ok { ... }and clearPendingSetRef(and persist) or surface a clear error so the controller doesn’t proceed with invalid data.
pkg/store/artifact_store_redis.go:1 - The comment about the “version token” doesn’t match the implementation: callers pass the previous raw manifest JSON string (from
GetManifest/loadManifest) andPutManifestcompares it directly to the current stored string; it’s not a “JSON encoding of the raw string value”. Please update the comment to describe the actual token semantics (raw stored value snapshot) to avoid confusing future maintainers.
pkg/workloadmanager/snapshot_controller.go:1 json.Marshalerrors are ignored here. While it “shouldn’t happen” for this struct, returning the marshal error is safer and avoids returning an empty/incorrect version token which could break compare-and-set behavior on subsequent writes.
| if err := agentd.AdvertiseDriverCapabilities(ctrl.SetupSignalHandler(), cs, nodeName, registry.Drivers()); err != nil { | ||
| fmt.Fprintf(os.Stderr, "unable to advertise driver capabilities: %v\n", err) | ||
| os.Exit(1) | ||
| } |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements the AgentCube SnapStart feature for CodeInterpreter.
SnapStart provides a snapshot-based startup acceleration path for CodeInterpreter sessions. It builds reusable runtime snapshot artifacts and lets new sandboxes restore from an active
snapshotKeyinstead of always cold-starting from the image.This PR implements the control-plane and node-agent pieces for SnapStart:
SandboxSnapshot: the generic snapshot CRD, supportingsnapshotMode=ForkandsnapshotMode=Resume.SnapshotClass: infrastructure/provider capability selection usingproviderName,supportedSnapshotModes, and node selection.SandboxSnapshotTask: internal node-facing task CRD used by the snapshot controller to dispatch snapshot creation to node agents.SandboxSnapshotController: workload-manager controller that manages snapshot lifecycle, target node selection, artifact task creation, retry handling, active/pending artifact promotion, and cleanup.SnapshotDriver: node-agent-local driver abstraction for provider-specific snapshot creation.agentd.SnapshotArtifactManifeststorage for active and pending snapshot artifact sets.Sandboxcreation.The implementation keeps snapshot orchestration generic. Runtime-specific controllers can express snapshot intent through
SandboxSnapshotandSandboxTemplate, while the snapshot controller manages snapshot tasks, artifacts, status, retries, and restore availability.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This PR follows the SnapStart proposal from #366 and implements the Phase 1 CodeInterpreter SnapStart path.
Key review areas:
SandboxSnapshot/SandboxSnapshotTaskAPI shape matches the proposal.SandboxSnapshotControlleris sufficient.Verification:
go test -run '^$' ./...passed.go test ./...was not runnable in the current sandbox because tests require local socket access, Redis, and kubeconfig.Does this PR introduce a user-facing change?: