Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion deploy/charts/operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The command removes all the Kubernetes components associated with the chart and
|-----|------|---------|-------------|
| fullnameOverride | string | `"toolhive-operator"` | Provide a fully-qualified name override for resources |
| nameOverride | string | `""` | Override the name of the chart |
| operator | object | `{"affinity":{},"autoscaling":{"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80},"containerSecurityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsNonRoot":true,"runAsUser":1000,"seccompProfile":{"type":"RuntimeDefault"}},"defaultImagePullSecrets":[],"env":[],"features":{"experimental":false},"gc":{"gogc":75,"gomemlimit":"110MiB"},"image":"ghcr.io/stacklok/toolhive/operator:v0.28.3","imagePullPolicy":"IfNotPresent","imagePullSecrets":[],"leaderElectionRole":{"binding":{"name":"toolhive-operator-leader-election-rolebinding"},"name":"toolhive-operator-leader-election-role","rules":[{"apiGroups":[""],"resources":["configmaps"],"verbs":["get","list","watch","create","update","patch","delete"]},{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","watch","create","update","patch","delete"]},{"apiGroups":["events.k8s.io"],"resources":["events"],"verbs":["create","patch"]}]},"livenessProbe":{"httpGet":{"path":"/healthz","port":"health"},"initialDelaySeconds":15,"periodSeconds":20},"nodeSelector":{},"podAnnotations":{},"podLabels":{},"podSecurityContext":{"runAsNonRoot":true},"ports":[{"containerPort":8080,"name":"metrics","protocol":"TCP"},{"containerPort":8081,"name":"health","protocol":"TCP"}],"proxyHost":"0.0.0.0","rbac":{"allowedNamespaces":[],"scope":"cluster"},"readinessProbe":{"httpGet":{"path":"/readyz","port":"health"},"initialDelaySeconds":5,"periodSeconds":10},"replicaCount":1,"resources":{"limits":{"cpu":"500m","memory":"128Mi"},"requests":{"cpu":"10m","memory":"64Mi"}},"serviceAccount":{"annotations":{},"automountServiceAccountToken":true,"create":true,"labels":{},"name":"toolhive-operator"},"tolerations":[],"toolhiveRunnerImage":"ghcr.io/stacklok/toolhive/proxyrunner:v0.28.3","vmcpImage":"ghcr.io/stacklok/toolhive/vmcp:v0.28.3","volumeMounts":[],"volumes":[]}` | All values for the operator deployment and associated resources |
| operator | object | `{"affinity":{},"autoscaling":{"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80},"containerSecurityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsNonRoot":true,"runAsUser":1000,"seccompProfile":{"type":"RuntimeDefault"}},"defaultImagePullSecrets":[],"env":[],"features":{"experimental":false,"storageVersionMigrator":true},"gc":{"gogc":75,"gomemlimit":"110MiB"},"image":"ghcr.io/stacklok/toolhive/operator:v0.28.3","imagePullPolicy":"IfNotPresent","imagePullSecrets":[],"leaderElectionRole":{"binding":{"name":"toolhive-operator-leader-election-rolebinding"},"name":"toolhive-operator-leader-election-role","rules":[{"apiGroups":[""],"resources":["configmaps"],"verbs":["get","list","watch","create","update","patch","delete"]},{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","watch","create","update","patch","delete"]},{"apiGroups":["events.k8s.io"],"resources":["events"],"verbs":["create","patch"]}]},"livenessProbe":{"httpGet":{"path":"/healthz","port":"health"},"initialDelaySeconds":15,"periodSeconds":20},"nodeSelector":{},"podAnnotations":{},"podLabels":{},"podSecurityContext":{"runAsNonRoot":true},"ports":[{"containerPort":8080,"name":"metrics","protocol":"TCP"},{"containerPort":8081,"name":"health","protocol":"TCP"}],"proxyHost":"0.0.0.0","rbac":{"allowedNamespaces":[],"scope":"cluster"},"readinessProbe":{"httpGet":{"path":"/readyz","port":"health"},"initialDelaySeconds":5,"periodSeconds":10},"replicaCount":1,"resources":{"limits":{"cpu":"500m","memory":"128Mi"},"requests":{"cpu":"10m","memory":"64Mi"}},"serviceAccount":{"annotations":{},"automountServiceAccountToken":true,"create":true,"labels":{},"name":"toolhive-operator"},"tolerations":[],"toolhiveRunnerImage":"ghcr.io/stacklok/toolhive/proxyrunner:v0.28.3","vmcpImage":"ghcr.io/stacklok/toolhive/vmcp:v0.28.3","volumeMounts":[],"volumes":[]}` | All values for the operator deployment and associated resources |
| operator.affinity | object | `{}` | Affinity settings for the operator pod |
| operator.autoscaling | object | `{"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}` | Configuration for horizontal pod autoscaling |
| operator.autoscaling.enabled | bool | `false` | Enable autoscaling for the operator |
Expand All @@ -57,6 +57,7 @@ The command removes all the Kubernetes components associated with the chart and
| operator.defaultImagePullSecrets | list | `[]` | List of image pull secrets that the operator applies as defaults to every workload it spawns (proxy runners, vMCP servers, registry API, etc.). Per-CR `imagePullSecrets` take precedence on name collisions; chart-level entries are appended additively. The operator parses these once at startup from the TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS environment variable. The Secrets must exist in the namespace where each workload is created. Each entry may be either a plain string (the Secret name) or an object with a `name` field, e.g.: defaultImagePullSecrets: - regcred - name: otherscred The two shapes are equivalent; the object form matches `operator.imagePullSecrets` above for convenience. |
| operator.env | list | `[]` | Environment variables to set in the operator container. Supported toolhive-specific variables include: - TOOLHIVE_SKIP_UPDATE_CHECK: set to "true" to disable the operator's periodic update check against the ToolHive update API. Also disables the usage-metrics collection that is gated on the same check. |
| operator.features.experimental | bool | `false` | Enable experimental features |
| operator.features.storageVersionMigrator | bool | `true` | Enable the StorageVersionMigrator controller, which auto-cleans status.storedVersions on opted-in toolhive.stacklok.dev CRDs so a future release can drop deprecated versions (e.g. v1alpha1) without orphaning etcd objects. Leave this on unless you are running kube-storage-version-migrator externally. This automatically sets ENABLE_STORAGE_VERSION_MIGRATOR environment variable. |
| operator.gc | object | `{"gogc":75,"gomemlimit":"110MiB"}` | Go memory limits and garbage collection percentage for the operator container |
| operator.gc.gogc | int | `75` | Go garbage collection percentage for the operator container |
| operator.gc.gomemlimit | string | `"110MiB"` | Go memory limits for the operator container |
Expand Down
2 changes: 2 additions & 0 deletions deploy/charts/operator/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ spec:
value: "true"
- name: ENABLE_EXPERIMENTAL_FEATURES
value: {{ .Values.operator.features.experimental | quote }}
- name: ENABLE_STORAGE_VERSION_MIGRATOR
value: {{ .Values.operator.features.storageVersionMigrator | quote }}
{{- if eq .Values.operator.rbac.scope "namespace" }}
- name: WATCH_NAMESPACE
value: "{{ .Values.operator.rbac.allowedNamespaces | join "," }}"
Expand Down
7 changes: 7 additions & 0 deletions deploy/charts/operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ operator:
features:
# -- Enable experimental features
experimental: false
# -- Enable the StorageVersionMigrator controller, which auto-cleans
# status.storedVersions on opted-in toolhive.stacklok.dev CRDs so a
# future release can drop deprecated versions (e.g. v1alpha1) without
# orphaning etcd objects. Leave this on unless you are running
# kube-storage-version-migrator externally.
# This automatically sets ENABLE_STORAGE_VERSION_MIGRATOR environment variable.
storageVersionMigrator: true
# -- Number of replicas for the operator deployment
replicaCount: 1

Expand Down
114 changes: 114 additions & 0 deletions docs/operator/storage-version-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Storage Version Migration

The ToolHive operator ships a `StorageVersionMigrator` controller that keeps every ToolHive CRD's `status.storedVersions` list clean, so a future operator release can drop deprecated API versions (e.g. `v1alpha1`) without orphaning objects in etcd.

## Why this exists

When a CRD graduates from, say, `v1alpha1` to `v1beta1` with both versions served and `v1beta1` as the storage version, existing objects continue to work — they are transparently converted on read/write. But the API server records every version that has ever been used for storage in `CustomResourceDefinition.status.storedVersions`. Until that list is trimmed, the Kubernetes API server refuses to let you remove a version from `spec.versions`, because doing so would orphan any etcd-stored objects encoded at that version.

The cleanup is not automatic. Someone has to re-store every existing object at the current storage version, then explicitly patch `status.storedVersions` to drop the old entry. The `StorageVersionMigrator` controller does this for you, on every opted-in ToolHive CRD, continuously. See [upstream Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/) for the mechanism.

## What the controller does

For each opted-in CRD:

1. Reads `spec.versions` to find the entry with `storage: true`.
2. If `status.storedVersions` already equals `[<currentStorageVersion>]` and only one version is served, nothing to do.
3. Otherwise, lists every Custom Resource of that kind and issues a metadata-only Server-Side Apply against the `/status` subresource with field manager `thv-storage-version-migrator`. This forces the API server to re-encode each object at the current storage version without triggering admission webhooks (SSA on `/status` typically bypasses webhooks registered on the main resource, and the empty apply owns no fields so it doesn't fight other controllers).
4. Once every object has been re-stored, patches `CRD.status.storedVersions` to `[<currentStorageVersion>]` using an optimistic-lock merge — so concurrent API-server writes cause a clean retry rather than a silent overwrite.

CRDs without a `/status` subresource fall back to main-resource SSA.

## The opt-in label

A CRD participates in migration only if it carries:

```yaml
metadata:
labels:
toolhive.stacklok.dev/auto-migrate-storage-version: "true"
```

The label is set at CRD-generation time via a kubebuilder marker on each Go type in `cmd/thv-operator/api/v1beta1/`:

```go
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
type MCPServer struct { ... }
```

`task operator-manifests` bakes the label into the generated CRD YAML. All current ToolHive root types ship with the marker. A CI test (`TestV1beta1TypesMarkerCoverage`) fails the build if a root type is added without either this marker or an explicit `// +thv:storage-version-migrator:exclude` sibling marker — so the migrator cannot silently forget a new CRD.

Adding a new CRD that should be migrated:

```go
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
type NewShinyThing struct { ... }
```

Adding a new CRD that deliberately should NOT be migrated (e.g. an experimental kind that is still stabilising its schema):

```go
// +thv:storage-version-migrator:exclude
type ExperimentalThing struct { ... }
```

## Disabling the controller

Set the Helm feature flag:

```yaml
operator:
features:
storageVersionMigrator: false # default: true
```

This sets `ENABLE_STORAGE_VERSION_MIGRATOR=false` on the operator Deployment, and the reconciler is not registered with the manager.

Disable only if you are running an external migrator such as [kube-storage-version-migrator](https://github.com/kubernetes-sigs/kube-storage-version-migrator). Disabling without a replacement is a footgun: the next ToolHive release that removes a deprecated API version will refuse to apply its CRD update until `storedVersions` is cleaned, and you will have to clean it yourself.

## Per-CRD emergency escape hatch

Removing the label on a live cluster excludes that single CRD from migration immediately:

```bash
kubectl label crd/mcpservers.toolhive.stacklok.dev \
toolhive.stacklok.dev/auto-migrate-storage-version-
```

Intended for incident response only. If you deploy the operator via GitOps (Argo CD, Flux) or `helm upgrade`, the chart will re-apply the chart-set label within seconds. Use the `storageVersionMigrator` feature flag for long-term opt-out.

## Interaction with version removal releases

The `StorageVersionMigrator` must have had time to run against your cluster *before* an operator release that drops a deprecated CRD version ships. The typical sequence is:

1. **Release N**: both versions served, newer version is storage, `StorageVersionMigrator` enabled. The controller quietly re-stores all objects and trims `storedVersions` on every cluster during this deprecation window.
2. **Release N+1+**: the deprecated version is removed from `spec.versions`. Because every cluster's `storedVersions` was already cleaned in the previous release, the CRD update applies cleanly.

If your cluster upgraded directly from a pre-migrator release to the version-removal release without ever running release N, you must clean `storedVersions` manually (or deploy `kube-storage-version-migrator` once) before the upgrade can succeed.

## Verification

For any ToolHive CRD in a cluster where the controller has run:

```bash
kubectl get crd mcpservers.toolhive.stacklok.dev \
-o jsonpath='{.status.storedVersions}'
# ["v1beta1"]
```

If the list contains more than one entry, the controller has not yet finished migrating — check operator logs for reconcile errors and the `StorageVersionMigrationFailed` event on the CRD.

## RBAC

The controller requires (generated from kubebuilder markers, applied by the operator Helm chart):

- `customresourcedefinitions.apiextensions.k8s.io`: `get`, `list`, `watch`
- `customresourcedefinitions/status.apiextensions.k8s.io`: `update`, `patch`
- `*.toolhive.stacklok.dev`: `get`, `list`, `patch`
- `*/status.toolhive.stacklok.dev`: `patch`

## Related

- Issue: [stacklok/toolhive#4969](https://github.com/stacklok/toolhive/issues/4969)
- Kubernetes CRD versioning: [official docs](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/)
- Reference implementation: [kubernetes-sigs/cluster-api `crdmigrator`](https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/crdmigrator)
Loading
Loading