Skip to content

Add storage-version migration docs + upgrade-guide walkthrough#5451

Open
ChrisJBurns wants to merge 1 commit into
mainfrom
chris/svm-docs
Open

Add storage-version migration docs + upgrade-guide walkthrough#5451
ChrisJBurns wants to merge 1 commit into
mainfrom
chris/svm-docs

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

Summary

User-facing documentation for the StorageVersionMigrator controller, now that the controller (PR #5362), opt-in labels + CI guard (PR #5391), and chart-side feature flag (PR #5418) are all on main.

Two files:

  • docs/operator/storage-version-migration.md — reference doc explaining the mechanism, label contract, opt-in flag, admission-policy compatibility, and the skip-a-version upgrade trap.
  • docs/operator/upgrade-guide/ — kind-cluster reproducible end-to-end walkthrough of the v1alpha1→v1beta1 graduation, plus CR fixtures for all 12 graduated CRDs.

Supersedes #5011, which carried an earlier draft of the same docs against the pre-review design (SSA on /status, exclude marker, default-on). The content here matches what actually shipped on main.

Part of #4969.

Medium level

Reference doc highlights:

  • Mechanism: plain Get+Update on the main resource, per-CR conflict retry up to 3 attempts, RequeueAfter: 30s on the sentinel path (not exponential backoff).
  • Concurrent-write safety: documents the upstream kube-storage-version-migrator semantics — a Conflict on Update means another writer succeeded, which already re-encoded the CR at the storage version.
  • Admission-policy compatibility section: webhooks fire on every Update before the apiserver's bytes-equality elision check. Policies that reject same-spec round-trip Updates (Kyverno, Gatekeeper, OPA) would prevent the migrator from converging. Worth checking before enabling on a cluster with strict admission policies.
  • Label contract: every storage-version root type must carry +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true. CI test TestStorageVersionRootMarkerCoverage enforces this; no self-serve exclude marker exists.
  • Skip-a-version upgrade trap: clusters that upgrade past the migrator-enabled release into a future version-removal release will hit an apiserver rejection on the CRD update. Recovery path documented (kube-storage-version-migrator).
  • RBAC list matches the chart's current ClusterRole on main.

Upgrade-guide walkthrough:

  • Step-by-step reproducible verification on a local kind cluster, ~30 minutes wall-clock.
  • Replays the full sequence: install pre-graduation release, create CRs at v1alpha1, upgrade to multi-version release with migrator OFF, re-apply CRs at v1beta1, confirm storedVersions is stuck at [v1alpha1, v1beta1], enable the migrator, confirm convergence to [v1beta1] on all 12 CRDs.
  • Verifies zero-downtime upgrade and operator event emission alongside the storedVersions trim.
Low level
File Change
docs/operator/storage-version-migration.md New — reference docs
docs/operator/upgrade-guide/README.md New — kind-cluster walkthrough
docs/operator/upgrade-guide/crs-v1alpha1.yaml New — fixture for the 12 graduated CRDs at v1alpha1
docs/operator/upgrade-guide/crs-v1beta1.yaml New — same kinds at v1beta1

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation
  • Other

Test plan

  • Reference doc content cross-checked against the on-main controller code at cmd/thv-operator/controllers/storageversionmigrator_controller.go — mechanism, retry semantics, error events, RBAC, label contract all match.
  • env-var names match TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR (PR-A's rename).
  • CI test name in doc matches TestStorageVersionRootMarkerCoverage (PR-B's rename).
  • Chart-value default in doc (false) matches values.yaml on main.
  • Recommended pre-merge: a reviewer with a local kind environment runs the upgrade-guide walkthrough end-to-end. The walkthrough is the primary validation artifact for the feature.

Why a separate PR

PR #5418 deliberately shipped only the chart surface so the reviewer could focus on the helm bits without the doc volume. This PR is the docs companion. They could have been combined; splitting was a reviewer-ergonomics choice.

Follow-ups for #4969 closure

After this PR merges, only deferred follow-ups remain:

  • Operator-code default flip in app/app.go (paired with a future chart-default flip).
  • Chart-conditional RBAC gating.
  • CRD-level MigrationStuck Condition + Warning event.
  • Full Prometheus metrics suite.

None of these are blocking the user-facing rollout. #4969 can close once this lands.

Generated with Claude Code

End-to-end user documentation for the StorageVersionMigrator
controller now that the chart surface (PR #5418) is in main.

Reference doc — docs/operator/storage-version-migration.md:
- Describes the actual shipped mechanism: plain Get+Update on the
  main resource, per-CR conflict retry (max 3), RequeueAfter
  sentinel on the conflict path.
- Admission-policy compatibility section: webhooks fire on every
  Update before the bytes-equality elision check, so policies
  (Kyverno/Gatekeeper/OPA) that reject same-spec round-trip
  Updates will prevent the migrator from converging.
- ⚠ Skip-a-version upgrade trap section: clusters that bypass an
  intermediate release that runs the migrator will hit a helm
  upgrade failure at the version-removal release;
  kube-storage-version-migrator documented as the recovery path.
- Label contract reflects the no-escape-hatch rule from PR-B —
  every storage-version root type must carry the migrate marker.
- RBAC list matches what's actually on main.

Upgrade-guide walkthrough — docs/operator/upgrade-guide/:
- Reproducible kind-cluster end-to-end test of the v1alpha1→v1beta1
  graduation, verifying storedVersions converges to [v1beta1] on
  all 12 graduated CRDs after enabling the migrator.
- CR fixtures for v1alpha1 and v1beta1 of all 12 graduated kinds.

Supersedes #5011, which carried earlier-draft versions that
described the pre-review SSA-on-/status mechanism, the removed
exclude marker, and the old env-var name.

Part of #4969.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.84%. Comparing base (a785995) to head (d1e823b).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5451      +/-   ##
==========================================
- Coverage   68.85%   68.84%   -0.02%     
==========================================
  Files         634      634              
  Lines       64437    64439       +2     
==========================================
- Hits        44371    44364       -7     
- Misses      16789    16794       +5     
- Partials     3277     3281       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant