Skip to content

vMCP session restore ignores live request identity, breaks token-exchange backends across pod restarts #5387

@jerm-dro

Description

@jerm-dro

Bug description

When VirtualMCPServer.spec.sessionStorage is enabled and a gateway pod restart causes a session-cache miss, factory.RestoreSession reconstructs identity from storedMetadata[MetadataKeyIdentitySubject] only — leaving identity.Token empty. Token-exchange backends then short-circuit on identity has no token, every backend init fails, and the restored session is returned with count=0 tools. The bearer that the OIDC middleware just validated on the triggering request is sitting on the request context, unused.

Root cause

MultiSessionGetter.GetMultiSession(sessionID string) (pkg/vmcp/server/sessionmanager/session_manager.go:633) is the entry point for subsequent requests. Its signature takes only the session ID — no context.Context, no *http.Request, no auth.Identity. There's an in-file TODO at line 631 acknowledging the context-propagation gap.

On a cache miss, loadSessionfactory.RestoreSession runs with no access to the live request. RestoreSession (pkg/vmcp/session/factory.go:537, identity reconstruction at 563–567) therefore reconstructs from stored metadata only:

var identity *auth.Identity
if subject := storedMetadata[MetadataKeyIdentitySubject]; subject != "" {
    identity = &auth.Identity{}
    identity.Subject = subject
}

The comment at line 560 makes the storage intent explicit: "The original bearer token is never persisted… so Token is empty." That part is fine. The gap is that the restore path has no fallback to the live identity on the triggering request.

The tokenless identity flows into makeBaseSession → backend connector → pkg/vmcp/auth/strategies/tokenexchange.go, where if identity.Token == "" { return "identity has no token" } fails every backend init.

Steps to reproduce

  1. Deploy a VirtualMCPServer with incomingAuth.type: oidc, at least one backend MCPServer with externalAuthConfigRef of type tokenExchange, and spec.sessionStorage pointing at Redis.
  2. Connect a Streamable HTTP client (e.g. Claude Code) and successfully list/call tools.
  3. Cycle the gateway pod (kubectl rollout restart, operator reconcile, image bump).
  4. With the same client still connected, send another tool request.

Expected behavior

The next request restores the session from Redis and uses the live bearer on the request context to mint per-backend exchanged tokens. Tool calls succeed without the client re-authenticating.

Actual behavior

Same session_id is restored from Redis, but audit logs show subjects.user: "anonymous". Pre-call logs show identity has no token for every backend, then All backends failed to initialise; session will have no capabilities, then prefix strategy created unique tools count=0. The next tool call returns tool not found. Manual /mcp re-auth fixes it.

Multi-replica symptom (same root cause)

In multi-replica deployments the same bug manifests as cross-pod cache eviction: pod B's failed RestoreSession returns a session with an empty backend list, loadSession writes that empty-list metadata back to Redis (session_manager.go:738), pod A's checkSession sees MetadataKeyBackendIDs drift and evicts its working session, and the next request to pod A returns "session is closed". The eviction mechanism itself (drift-based propagation of legitimate backend membership changes) is intentional and should stay — what's wrong is that pod B's restore failed in the first place. Fixing the identity restore should resolve this symptom too.

Out of scope

  • Persisting exchanged backend tokens or refresh tokens. The per-backend tokens minted at restore still have their own expirations; session-scoped token bookkeeping with refresh is a separate piece of work and not required for this fix.

Acceptance criteria

  • After a gateway pod restart, an already-connected Streamable HTTP client with OIDC + tokenExchange backends gets a successful tool call on the next request, without manual re-auth.
  • In a 2-replica deployment with sticky-cookie ingress, a sibling pod handling the same session does not cause the working pod to evict and close the session.
  • Anonymous and non-OIDC flows continue to work.
  • Regression test exercising restart with a tokenExchange backend and a live bearer on the post-restart request.

Environment

  • ToolHive operator v0.27.0
  • Redis (Bitnami 25.4.1) for sessionStorage
  • Streamable HTTP client (Claude Code), OIDC (Keycloak), per-backend tokenExchange

Additional context

Originally reported by Gaston in the Discord forum post with full logs and timeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    authenticationbugSomething isn't workingvmcpVirtual MCP Server related issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions