Skip to content

Add ptd verify command for post-deploy VIP testing#138

Draft
ian-flores wants to merge 23 commits intomainfrom
verify-command
Draft

Add ptd verify command for post-deploy VIP testing#138
ian-flores wants to merge 23 commits intomainfrom
verify-command

Conversation

@ian-flores
Copy link
Contributor

@ian-flores ian-flores commented Feb 23, 2026

Summary

  • Add ptd verify <target> command that runs VIP tests against a PTD deployment
  • Auto-generates vip.toml config from the Kubernetes Site CR
  • Provisions test user in Keycloak automatically
  • Runs tests as a K8s Job (default) or locally (--local)

Details

New files:

  • cmd/verify.go — Cobra command with flags: --site, --categories, --local, --config-only, --image, --realm, --test-username
  • cmd/internal/verify/config.go — Parses Site CR, generates vip.toml with product URLs and auth config
  • cmd/internal/verify/job.go — Creates K8s ConfigMap + Job, streams logs, waits for completion
  • cmd/internal/verify/verify.go — Orchestrates: fetch Site CR → generate config → ensure test user → run tests
  • cmd/internal/verify/keycloak.go — Auto-provisions test user via Keycloak admin REST API

Usage:

ptd verify ganso01-staging                          # Run all tests as K8s Job
ptd verify ganso01-staging --categories connect     # Run specific category
ptd verify ganso01-staging --config-only            # Just print generated vip.toml
ptd verify ganso01-staging --local                  # Run locally with uv
ptd verify ganso01-staging --site secondary         # Target a specific Site CR

Security considerations:

  • Test user password generated with crypto/rand (not hardcoded)
  • Job spec serialized via json.Marshal (no YAML injection)
  • Keycloak token request uses url.Values (proper URL encoding)
  • Secret credentials passed via stdin, not process argv
  • Base64 decoding uses Go stdlib, not platform-dependent subprocess

Tested

ptd verify ganso01-staging --config-only generates valid vip.toml from the live Site CR:

[general]
  deployment_name = "ganso01-staging"

[connect]
  enabled = true
  url = "https://pub.ganso.lab.staging.posit.team"

[workbench]
  enabled = true
  url = "https://dev.ganso.lab.staging.posit.team"

[package_manager]
  enabled = true
  url = "https://pkg.ganso.lab.staging.posit.team"

[auth]
  provider = "oidc"

[email]
  enabled = false

[monitoring]
  enabled = false

[security]
  policy_checks_enabled = false

Test plan

  • go build ./... passes
  • go vet ./... passes
  • Existing tests pass (go test ./...)
  • ptd verify ganso01-staging --config-only generates valid vip.toml
  • ptd verify ganso01-staging creates Job and streams logs
  • Test user is created in Keycloak on first run, skipped on subsequent runs
  • Non-zero exit code when tests fail

All builds pass and tests are green. Here's a summary:
---
Changes:
- `keycloak.go`: Replace `base64 -d` subprocess with `encoding/base64` stdlib (portable, no platform differences)
- `keycloak.go`: Generate test user password with `crypto/rand` instead of hardcoded `"vip-test-password-12345"`
- `keycloak.go`: Pass credentials secret via stdin YAML manifest instead of `--from-literal` argv (prevents password exposure in `/proc/<pid>/cmdline` and `ps`)
- `keycloak.go`: URL-encode Keycloak token request body using `url.Values` instead of raw string interpolation
- `job.go`: Build container args as a `[]string` slice marshaled to JSON instead of string-interpolating into YAML (prevents `--categories` injection)
- `verify.go`: Include timestamp suffix in configmap name (`vip-verify-config-<ts>`) to prevent concurrent-run conflicts
- `verify.go`: Use `context.Background()` with 30s deadline for cleanup deferred func (original `ctx` may be expired after 15-min job wait)
Build succeeds, all tests pass.
Changes:
- Fix password inconsistency: `createKeycloakUser` now calls `resetKeycloakUserPassword` (via Keycloak `PUT .../reset-password`) when user already exists, ensuring K8s secret always matches Keycloak credentials
- Add `keycloakHTTPTimeout = 30s` to both HTTP clients in `getKeycloakAdminToken` and `createKeycloakUser`
- Remove misleading `_` parameter from `getKeycloakAdminToken`; admin tokens always use master realm
- Add `namespace = "posit-team"` package constant; replace all 15+ hardcoded occurrences across job.go, keycloak.go, and verify.go
- Fix `StreamLogs`: log a warning (instead of silently returning nil) when `kubectl logs` exits with code 1
- Extract `parseJobStatus` helper from `WaitForJob` to make status parsing testable
- Warn via `slog.Warn` when `--image` flag uses `:latest` tag
- Add unit tests for `GenerateConfig` (URL building, auth precedence) and `parseJobStatus`
Build passes, all tests pass.
Changes:
- Fix HTTP response body leak in `createKeycloakUser`: use `searchResp` for the GET search response so its body is closed independently of the subsequent POST `resp`
- Add `context.Context` propagation to `getKeycloakAdminToken`, `createKeycloakUser`, and `resetKeycloakUserPassword`; use `http.NewRequestWithContext` in each
- URL-encode username in Keycloak user search query using `url.Values` instead of raw `fmt.Sprintf`
- Replace `fmt.Sprintf` YAML template in `createCredentialsSecret` with `json.Marshal` on a Go struct, consistent with `job.go`
- Guard against empty `Auth.Type` in `GenerateConfig`: only override the `"oidc"` default when `Auth.Type != ""`
- Add `KeycloakURL` field to `verify.Options` and `--keycloak-url` CLI flag; fall back to `https://key.<domain>` when not set
- Add tests for `buildProductURL` with `BaseDomain` override, `GenerateConfig` with empty `Auth.Type`, and `GenerateConfig` with empty domain
Changes:
- Fix `parseJobStatus` false-positive: change jsonpath separator from space to comma (`{Complete.status},{Failed.status}`) and update `parseJobStatus` to use `strings.SplitN(..., ",", 2)`, eliminating the ambiguity where an absent Complete condition produced ` True` → `["True"]` → false success
- Add test case for the previously-uncovered failure mode: only Failed condition present (`,True`)
- Update existing `TestParseJobStatus` cases to use comma-separated format
- Apply `url.PathEscape` to `realm` and `userID` in `keycloak.go` URL construction
- Add nil guard to `GenerateConfig` returning an explicit error instead of panicking on nil site
- Add `TestGenerateConfig_NilSite` test
Snyk flagged the string literal in keycloak.go as hardcoded
credentials. Pass the username via Options from the CLI layer
instead.
Changes:
- Add `getTestCredentials` helper that reads `vip-test-credentials` Secret for local test runs
- Inject `VIP_TEST_USERNAME`/`VIP_TEST_PASSWORD` env vars into local `uv run pytest` command (fixes auth failure in local mode)
- Add `Realm` field to `verify.Options` and `--realm` flag (default `"posit"`) to remove hardcoded realm
- Add `--test-username` flag (default `"vip-test-user"`) so operators can override the test username
- Wire `verifyRealm` and `verifyTestUsername` from flags into `verify.Options` instead of hardcoded values
Changes:
- Gate `EnsureTestUser` on `site.Spec.Keycloak != nil && site.Spec.Keycloak.Enabled`; log informational message and skip for non-Keycloak auth
- Add `--namespace` flag (default `posit-team`) and thread namespace through all functions instead of using the package-level constant
- Add `--keycloak-admin-secret` flag (default `{site}-keycloak-initial-admin`) so the secret name can be overridden when the default has been deleted or renamed
- Add `--timeout` flag (default 15m) for `WaitForJob` so large test suites can configure a longer wait
- `GenerateConfig` now returns an error when `site.Spec.Domain` is empty and any product is configured, preventing malformed URLs; `TestGenerateConfig_EmptyDomain` updated to expect this error
Changes:
- Guard credential fetch in `runLocalTests` behind `credentialsAvailable` bool so local mode no longer fails with "secret not found" on non-Keycloak deployments
- Guard `secretKeyRef` env vars in `CreateJob` behind `CredentialsAvailable` on `JobOptions` so K8s Job pods don't fail to start on non-Keycloak deployments
- Add 3-byte random hex suffix (6 chars) to Job and ConfigMap names alongside the timestamp to prevent collision on concurrent invocations within the same second
- Check pod phase after `kubectl logs` exit code 1 to distinguish normal pod completion (silent) from unexpected failures (Warn log)
- Extract `parseSecretData` from `getKeycloakAdminCreds` as a testable pure function
- Add unit tests: `TestParseSecretData_InvalidFormat`, `TestParseSecretData_InvalidBase64`, `TestParseSecretData_Valid`, `TestWaitForJob_ContextCancellation`
Changes:
- Fix domain validation in `config.go` to only require `site.Spec.Domain` when at least one configured product has no `BaseDomain` set, allowing valid multi-tenant configs where every product has its own `BaseDomain`
- Add `TestGenerateConfig_EmptyDomainAllBaseDomains` test to verify the valid all-`BaseDomain` case succeeds
- Drain response body (`io.Copy(io.Discard, ...)`) in `createKeycloakUser` before closing on non-200 search response, allowing connection pool reuse
- Fix misleading error message in `waitForPod`: distinguish `context.DeadlineExceeded` ("timeout") from other cancellations ("cancelled")
- Add `keycloak_test.go` with `httptest.NewServer`-based tests covering error status code paths for `getKeycloakAdminToken`, `createKeycloakUser`, `resetKeycloakUserPassword`, and the existing-user password-reset flow
The test that validates the cancellation path passes. The existing test uses `context.WithCancel` (not deadline exceeded), so it now hits the new `"cancelled while waiting for job to complete"` branch.
Changes:
- `job.go`: Distinguish `context.DeadlineExceeded` from `context.Canceled` in `WaitForJob`, mirroring `waitForPod`'s pattern
- `keycloak.go`: Return error when `access_token` is empty in `getKeycloakAdminToken` response
- `config.go`: Add comment explaining why PackageManager auth is intentionally excluded from auth provider detection
Changes:
- Use two-value type assertion for Keycloak user ID with explicit error when `id` is missing or not a string (`keycloak.go:201-204`)
- Only warn about `:latest` image tag when user explicitly passes `--image` flag, not when using the default (`verify.go:115`)
- Return error from `EnsureTestUser` when `kubectl get secret` fails (RBAC, network, etc.) instead of falling through to user creation (`keycloak.go:30-37`)
- Log a WARN when `kubectl get job` fails in `WaitForJob` instead of silently retrying (`job.go:263`)
Changes:
- `cmd/verify.go`: Remove `cmd.Flags().Changed("image")` guard so the `:latest` warning fires for the default image value, not just when the flag is explicitly set
- `cmd/internal/verify/job.go`: Fix `Cleanup` to always attempt both job and ConfigMap deletion, collecting errors with `errors.Join` instead of returning early on job deletion failure
- `cmd/internal/verify/verify.go`: Extract `deriveKeycloakURL` helper that validates domain is non-empty before constructing the Keycloak URL (only when Keycloak is enabled); avoids producing `https://key.` from an empty domain
- `cmd/internal/verify/keycloak.go`: Add `!@#$%` to password charset so generated passwords satisfy common Keycloak complexity policies requiring special characters
- `cmd/internal/verify/verify_test.go`: Add `TestDeriveKeycloakURL` covering the override, empty-domain-with-Keycloak-enabled error, and normal derivation cases
Changes:
- Log kubectl stderr in `WaitForJob` when `cmd.Output()` returns an `*exec.ExitError`, replacing uninformative "exit status 1" messages
- Fix `deriveKeycloakURL` to return `""` instead of malformed `"https://key."` when Keycloak is disabled and domain is empty; update test accordingly
- Deduplicate env vars in `runVerify` before appending credential and kubeconfig overrides, preventing ambiguous behavior when the caller's shell already exports the same keys
- Drain and close the search response body immediately in `createKeycloakUser` (both 200 and non-200 branches) before making the POST, returning the connection to the pool promptly
- Add comment in `waitForPod` documenting that `batch.kubernetes.io/job-name` requires Kubernetes 1.27+
Changes:
- `job.go`: `waitForPod` now tries the modern `batch.kubernetes.io/job-name` label first, then falls back to the legacy `job-name` label for Kubernetes < 1.27 clusters
- `job.go`: Extract `buildJobSpec(opts JobOptions) map[string]interface{}` from `CreateJob` to enable unit testing of the Job spec structure
- `keycloak.go`: Extract `buildSecretSpec(name, ns, username, password string) map[string]interface{}` from `createCredentialsSecret` to enable unit testing of the Secret spec structure
- `keycloak.go`: Add `slog.Warn("user search failed, attempting create", "status", searchResp.StatusCode)` before falling through to the create path, improving debuggability when Keycloak returns non-200 on user search
- `config.go`: Document `BaseDomain` field and `buildProductURL` to clarify that `BaseDomain` must be a bare parent domain (not including the product subdomain)
- `verify_test.go`: Add `TestBuildJobSpec` with table-driven cases covering credentials, categories, labels, and volume mount paths
- `verify_test.go`: Add `TestBuildSecretSpec` asserting correct `apiVersion`, `kind`, `type`, and base64-encoded credential fields
- `verify_test.go`: Add comment to `TestGenerateConfig_EmptyDomainAllBaseDomains` explaining the `BaseDomain` naming convention and the double-prefix behavior
Changes:
- Fix `TestGenerateConfig_EmptyDomainAllBaseDomains` to use a bare `BaseDomain: "custom.org"` and assert the correct URL `https://connect.custom.org` (the previous test documented and validated incorrect double-prefix behavior)
- Add `TestGenerateConfig_BaseDomainWithSubdomainProducesDoublePrefix` to explicitly document the double-prefix footgun when a fully-qualified hostname is mistakenly used as `BaseDomain`
- Log the actual `phaseErr` in `StreamLogs` warning instead of a generic message, making transient phase-check failures visible
- Validate that test credentials don't contain newline characters before injecting them into the subprocess environment in `runLocalTests`
- Add `strings` import to `verify.go` for the newline validation
Changes:
- `StreamLogs` now returns an error for non-completion cases (RBAC failures, eviction, unknown pod phase) instead of silently returning nil; only `"Succeeded"` or `"Failed"` phases are treated as normal completion
- `waitForPod` now trims whitespace from kubectl output before returning the pod name, preventing "pod not found" errors from trailing newlines
- `buildSecretSpec` now includes `app.kubernetes.io/name: vip-verify` and `app.kubernetes.io/managed-by: ptd` labels, consistent with Job and ConfigMap resources
- Extracted `buildLocalEnv` helper in `verify.go` that strips any pre-existing `VIP_TEST_USERNAME`/`VIP_TEST_PASSWORD` entries from the parent env before appending fresh credentials, preventing duplicates
- Added `TestBuildLocalEnv` table-driven tests covering newline injection guard (username, password, carriage return), deduplication of pre-existing credential keys, and clean-env append
- Updated `TestBuildSecretSpec` to assert the new management labels are present on the Secret metadata
Build and tests pass. Here's a summary of the changes:
Changes:
- `getPodPhase` now accepts `ctx context.Context` as first parameter; its 5-second timeout is derived from the parent context rather than `context.Background()`, enabling proper cancellation propagation
- Updated `StreamLogs` call site to pass `ctx` to `getPodPhase`
- Added `vipTestCredentialsSecret` constant in `keycloak.go` to replace repeated `"vip-test-credentials"` string literals
- Removed `getTestCredentials` thin wrapper; call sites now call `getKeycloakAdminCreds` with the constant directly, making the call chain explicit
- Added comment to `EnsureTestUser` documenting the TOCTOU race and why it is accepted as benign for single-operator use
- Added comment to `buildJobSpec` explaining that no `--config` flag is passed because VIP defaults to reading `/app/vip.toml`, matching the ConfigMap mount path
Build and tests pass. Here's the summary:
Changes:
- Handle 409 Conflict from Keycloak create by re-searching for the user and resetting their password, avoiding a confusing error when search permissions were insufficient
- Remove `!@#$%` special characters from generated password charset to avoid Keycloak password policy rejections
- Add `Timeout` field to `JobOptions` and derive `activeDeadlineSeconds` from it (timeout minus 60s buffer) so the Job deadline aligns with the CLI poll timeout
- Add null byte (`\x00`) to `buildLocalEnv` credential validation guard to prevent opaque OS exec failures
- Add test `TestCreateKeycloakUser_409ResearchResetsPassword` covering the new 409 re-search path
- Add null byte test cases to `TestBuildLocalEnv`
Changes:
- `keycloak.go`: Add special characters (`!@#$%`) to `generatePassword` alphabet for compatibility with Keycloak password policies requiring non-alphanumeric characters
- `config.go`: Add nil guard to `buildProductURL` so future callers passing a nil `*ProductSpec` get a safe default URL instead of a panic
- `job.go`: Add `waitForPodRunning` helper and call it in `StreamLogs` before streaming logs, eliminating the spurious `unexpected pod phase "Pending"` warning when the pod object exists but the container hasn't started yet
- `verify_test.go`: Remove unused `wantDedup bool` field from `TestBuildLocalEnv` test struct (the deduplication is already verified by the count assertions that run for all cases)
Changes:
- Add `errImagePull` sentinel and `getPodWaitingReason` helper; after `waitForPodRunning` times out, check for `ImagePullBackOff`/`ErrImagePull` and return a fatal error immediately instead of proceeding to log stream
- Propagate `errImagePull` from `StreamLogs` as a fatal error in `runKubernetesTests`, aborting before the 15-minute `WaitForJob` wait
- Add `jobLabelSelector` that probes the server version once before the pod wait loop, eliminating the double kubectl call per tick on pre-1.27 clusters
- Add `timeout time.Duration` parameter to `waitForPodRunning` and `StreamLogs`; `StreamLogs` scales the pod-start wait as `min(timeout/4, 5min)` instead of always using 60s
- Rename `getKeycloakAdminCreds` to `getSecretCredentials` to reflect its generic secret-reading purpose; update both call sites
- Add `TestGetKeycloakAdminToken_EmptyAccessToken` test covering a 200 response that omits `access_token`
@statik
Copy link
Collaborator

statik commented Feb 24, 2026

this is cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants