feat(picod): implement two-stage secure initialization to isolate sandbox sessions#352
feat(picod): implement two-stage secure initialization to isolate sandbox sessions#352Abhinav-kodes wants to merge 1 commit into
Conversation
|
@Abhinav-kodes: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
21ecf85 to
4e28b8b
Compare
There was a problem hiding this comment.
Code Review
This pull request implements a two-stage authentication mechanism for PicoD sandboxes by introducing an /init endpoint and bootstrap key generation. The reviewer identified a critical security flaw where the router's shared key is incorrectly reused for session initialization instead of a unique session key. Additionally, the review points out that the session key can be overwritten, suggests refactoring duplicated PEM parsing logic, highlights the need for comprehensive unit tests for the new initialization flow, and recommends making the HTTP client timeout configurable.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #352 +/- ##
==========================================
+ Coverage 47.57% 57.32% +9.75%
==========================================
Files 30 36 +6
Lines 2819 3677 +858
==========================================
+ Hits 1341 2108 +767
+ Misses 1338 1325 -13
- Partials 140 244 +104
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
1c65fc7 to
2ef218c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a two-stage authentication and initialization workflow for PicoD. It introduces bootstrap keys to verify an initial setup phase that establishes unique session-specific keys for each sandbox. The workload manager has been updated to generate these ephemeral keys and perform remote initialization of PicoD instances, while the router now supports signing JWTs with per-session private keys. Reviewer feedback identifies performance bottlenecks related to frequent RSA key parsing and redundant HTTP client instantiation, and provides corrections for the golang-jwt/jwt/v5 library usage to ensure proper claim enforcement.
d610e39 to
1672739
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a two-stage authentication and initialization flow for PicoD, introducing a bootstrap public key and an /init endpoint for session-specific key registration. The changes include updates to the Workload Manager to generate and distribute these keys, and modifications to the Router to support session-specific JWT signing. Feedback includes concerns regarding a potential memory leak in the JWTManager due to an unbounded cache, the use of global variables in bootstrap_auth.go which hinders testability and thread safety, and a recommendation to improve security by verifying the issuer claim in the bootstrap JWT.
|
/gemini review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a two-stage secure initialization trust chain between the Router, Workload Manager, and PicoD. It replaces the static RSA key pair model with a bootstrap RSA key pair used to verify an /init handshake, which dynamically establishes a unique, per-session ECDSA (P-256) key pair. The session private key is encrypted via AES-GCM and stored in Redis/Valkey. Feedback on the changes points out a compilation error in pkg/picod/auth.go due to the use of an invalid JWT parser option (jwt.WithIssuedAt), and suggests improving responsiveness to context cancellation in pkg/router/jwt.go by replacing time.Sleep with a select block.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a two-stage secure initialization trust chain for PicoD, transitioning from a static RSA key pair to a dynamic ECDSA session key pair established via an /init handshake. It also introduces AES-GCM encryption for storing session private keys in Redis/Valkey. The review feedback highlights several critical issues: a strict session ID check that breaks the warm pool feature, fragile PEM-string hashing for encryption keys that should use raw DER bytes instead, Redis/Valkey cluster compatibility issues due to transactional multi-key operations (CROSSSLOT errors), and a potential goroutine leak in PicoD's background cleanup task due to uncanceled contexts.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a two-stage secure initialization trust chain between the Router, Workload Manager, and PicoD to mitigate cross-sandbox token replay vulnerabilities. Instead of a single static RSA key pair, it introduces a bootstrap RSA key pair used only for an /init handshake, which dynamically establishes a unique ECDSA (P-256) session key pair for each sandbox instance. Subsequent requests are signed using ES256 JWTs. The changes span PicoD, Workload Manager, Router, and store packages, adding symmetric encryption (AES-GCM) for storing session private keys in Redis/Valkey. The reviewer identified two key issues: first, treating 409 Conflict as a successful initialization during /init retries is dangerous because a new key pair is generated on each retry, leading to a mismatch between the stored private key and PicoD's registered public key; second, the initialization context timeout is hardcoded to 5 seconds, ignoring the configured PicoInitTimeout value.
|
cc @hzxuzhonghu |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (3)
pkg/workloadmanager/sandbox_helper.go:1
- The comment at lines 259–261 contradicts the logic above: 409 is explicitly treated as an error, not as success. Update the comment to match the actual behavior (e.g., “ONLY after confirmed 200 OK”) to avoid future incorrect changes that could accidentally persist a mismatched private key.
pkg/workloadmanager/sandbox_helper.go:1 - When
httpClient.Doreturns a non-nilrespalongside anerr(which can happen in some HTTP failure cases), the current code does not closeresp.Body, potentially leaking connections and degrading performance under retries. Consider closing/discardingresp.Bodyon the error path whenresp != nilbefore the next retry.
pkg/workloadmanager/sandbox_helper_test.go:1 - This file previously contained broader table-driven tests for placeholder/info building logic, but the diff removes them and replaces them with PicoD init tests only. Since
buildSandboxInfoand placeholder TTL/expiry behavior are still critical to API correctness, consider reintroducing equivalent coverage (in this file or a new one) to avoid regressions in expiry/entrypoint formatting or lifecycle handling.
63249f8 to
e67c8b2
Compare
e67c8b2 to
1b51dc4
Compare
1b51dc4 to
ff87f32
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
pkg/workloadmanager/sandbox_helper.go:1
- The comment at lines 259–261 says “200 OK or 409”, but the preceding block explicitly treats 409 as a terminal error (non-200 => error). Please update the comment to match the implemented behavior (or, if 409 should be handled as success, adjust the code accordingly). This inconsistency is likely to confuse future maintainers during incident/debugging.
ff87f32 to
40c1e47
Compare
40c1e47 to
496bb8f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 39 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
pkg/workloadmanager/sandbox_helper.go:1
- When
Do(req)returns a non-nilrespalongside anerr, the response body can remain open across retries, leaking connections. Closeresp.Body(whenresp != nil) before continuing to the next retry attempt.
pkg/workloadmanager/sandbox_helper.go:1 - The comment at line 259 contradicts the implementation and the preceding explanation:
409 Conflictis explicitly not treated as success, but the comment says keys are assigned after200 OK or 409. Update the comment to reflect actual behavior (assignment only after200 OK) to avoid misleading future changes.
496bb8f to
bdbebc8
Compare
bdbebc8 to
e620982
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 40 out of 40 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (3)
pkg/workloadmanager/sandbox_helper.go:1
- The jitter uses the global
math/randsource but this file doesn’t seed it, so across processes the jitter sequence is deterministic (default seed = 1), which undermines the “avoid thundering herd” goal. Use a per-Serverrand.Randseeded from time (orcrypto/rand-derived seed), or switch to a standard backoff/jitter helper (e.g.,wait.JitterUntil/wait.Backoff) to ensure real divergence between replicas.
pkg/workloadmanager/sandbox_helper.go:1 - The comment on line 259 says the key is assigned after “200 OK or 409”, but the implementation explicitly treats 409 as a terminal error (and never assigns the key). Update the comment to match the actual behavior (200 OK only), or adjust the logic if 409 is intended to be handled as success.
pkg/workloadmanager/sandbox_helper_test.go:1 - This file previously covered
buildSandboxPlaceHolder(...)andbuildSandboxInfo(...)with table-driven tests, but those tests were removed in this change. Since those functions still exist and are pivotal for persisted sandbox metadata, consider restoring equivalent coverage (either reintroduce the table-driven cases here or add focused tests elsewhere) to avoid regressions in expiry/status/entrypoint formatting behavior.
… bootstrap keys Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
e620982 to
80d4075
Compare
feat(picod): implement two-stage secure initialization to isolate sandbox sessions
What type of PR is this?
/kind feature
/kind security
What this PR does / why we need it
This PR implements Two-Stage Secure Initialization for PicoD to resolve a critical security vulnerability regarding cross-sandbox token replays, while incorporating defense-in-depth, encrypted key storage, and performance optimizations.
Previously, under the Plain Authentication design, all PicoD sandboxes verified JWTs using the same shared public key injected via the
PICOD_AUTH_PUBLIC_KEYenvironment variable. This meant a valid JWT issued for one sandbox could theoretically be replayed against another sandbox, breaking tenant isolation.This PR introduces a cryptographically isolated two-stage handshake:
Bootstrap Stage: The Workload Manager generates a static RSA-2048 Bootstrap Key Pair on startup, stores it in the
agentcube-bootstrap-identityKubernetes Secret/ConfigMap (renamed frompicod-router-identity/picod-router-public-key), and injectsPICOD_BOOTSTRAP_PUBLIC_KEYalong with a uniquePICOD_SESSION_IDinto each PicoD container environment. PicoD loads the bootstrap key at startup and waits in an uninitialized state.Session Stage (
POST /init): The Workload Manager generates a per-sandbox ECDSA (P-256) Session Key Pair. It calls the newPOST /initendpoint on PicoD with a short-lived bootstrap JWT (RS256, signed by the Bootstrap Private Key, scoped bysub=PICOD_SESSION_ID) containing thesession_public_keyclaim. On success, PicoD stores the ECDSA public key and transitions to initialized. All subsequent user requests must be signed with the ECDSA Session Private Key (ES256).Files Changed
cmd/picod/main.gocontext.Background()toNewServerto enable lifecycle-aware goroutinescmd/workload-manager/main.goserver.GetBootstrapPublicKeyPEMintoCodeInterpreterReconcilerafter server constructionpkg/picod/auth.goAuthManager: split intobootstrapPublicKey(RSA) +sessionPublicKey(ECDSA); addedVerifyBootstrapJWT,SetSessionPublicKey, JTI replay-protection viasync.Map, background JTI eviction goroutinepkg/picod/server.goPOST /initroute with per-endpoint rate limiter (2 req/s, burst 5);InitHandlerverifies bootstrap JWT and sets session key; API routes return503 DAEMON_NOT_INITIALIZEDuntil session is establishedpkg/router/jwt.goagentcube-bootstrap-identity; added LRU key cache (keyCacheMaxSize=1000) for per-session ECDSA private keys; addedGenerateTokenWithKeyandGetCachedKeyhelperspkg/router/handlers.gogenerateSandboxJWTfetches per-session ECDSA private key from store (withErrKeyNotCachedfallback); returns503 SESSION_PRIVATE_KEY_NOT_FOUNDif key is missingpkg/router/server.goIdleConnTimeout: 85sandMaxIdleConnsPerHost: 100on HTTP transport; passstoreClienttoTryStoreOrLoadJWTKeySecretpkg/store/interface.goStoreSessionPrivateKey,GetSessionPrivateKey,SetEncryptionKeytoStoreinterfacepkg/store/crypto.go(new)Cryptohelper for at-rest encryption of session private keyspkg/store/keys.go(new)sessionPrivKeyPrefixconstant shared by Redis and Valkey backendspkg/store/store_redis.goStoreSessionPrivateKey/GetSessionPrivateKey(AES-GCM + base64); updatedDeleteSandboxBySessionIDto use non-transactionalPipeline()for cluster-safe deletionpkg/store/store_valkey.goDeleteSandboxBySessionIDuses bareDoMultifor cluster-safe deletionpkg/common/types/sandbox.goSessionPrivateKey(excluded from JSON, persisted separately) andAuthModefields toSandboxInfodocs/design/PicoD-Plain-Authentication-Design.mddocs/design/auth-proposal.mddocs/agentcube/blog/release-v0.1.0/index.mdpkg/picod/auth_test.go,picod_test.go,execute_test.go,server_test.goTestInitHandlercovering success/conflict/invalid-token cases,TestAuthMiddleware_ValidTokenpkg/router/session_manager_test.gopkg/workloadmanager/server.goPicoInitTimeout, bootstrap auth manager wiring,GetBootstrapPublicKeyPEM, and store encryption key setuppkg/workloadmanager/workload_builder.goAuthModedefaults from the new modelKey Security Hardening & Optimizations Included
PICOD_SESSION_ID): PicoD validates thesubclaim in the bootstrap JWT againstPICOD_SESSION_IDwhen set (on-demand pods). Warm pool pods that are pre-created before a session is assigned skip this check; replay protection is still enforced via the JTI guard.AuthManagertracks seen JTIs viasync.Mapwith a background 1-minute cleanup goroutine; bootstrap JTIs older than 3 minutes are evicted.SetSessionPublicKeyreturnsErrAlreadyInitializedon second call;InitHandlerresponds409 Conflictto prevent session key hijacking via re-initialization.pkg/store/crypto.go) before persistence.DeleteSandboxBySessionIDuses a non-transactionalPipeline()(Redis) / bareDoMulti(Valkey) to delete sandbox metadata and the private key safely in clustered deployments./init: A token-bucket rate limiter (2 req/s, burst 5) guards the/initendpoint against spam during the startup window.IdleConnTimeoutset to 85s (aligned with AWS ALB 60s timeout) andMaxIdleConnsPerHost=100to prevent stale-connection race conditions./inithandshake so sandbox initialization does not block indefinitely.SandboxInfoandCreateSandboxResponsenow carryAuthMode, allowing the Router to distinguishpicodsandboxes fromnone-auth sandboxes.PICOD_AUTH_PUBLIC_KEYis still accepted as a fallback; deployments should migrate toPICOD_BOOTSTRAP_PUBLIC_KEY.Which issue(s) this PR fixes
Fixes #
Special notes for your reviewer
This implementation realizes the architecture outlined in
docs/design/agentcube-proposal.mdunder section 5.2 (Picod Workflow). The relevant design documents (docs/design/PicoD-Plain-Authentication-Design.mdandauth-proposal.md) have been fully updated within this PR to accurately document the newPICOD_BOOTSTRAP_PUBLIC_KEY,PICOD_SESSION_ID, and the two-stage ECDSA handshake.The initialization order in
cmd/workload-manager/main.gowas intentionally changed:CodeInterpreterReconcileris now constructed afterNewServerso it can receiveserver.GetBootstrapPublicKeyPEMas a live function reference rather than a nil pointer.Does this PR introduce a user-facing change?