Implement POD autoscaling and ConfigMaps for API definitions#26
Open
buger wants to merge 135 commits into
Open
Implement POD autoscaling and ConfigMaps for API definitions#26buger wants to merge 135 commits into
buger wants to merge 135 commits into
Conversation
This commit introduces comprehensive improvements to the performance testing infrastructure: ## POD Autoscaling (HPA) Enhancements - Enable HPA by default with increased replica limits (2-12 replicas) - Improved autoscaling configuration for better performance testing - Enhanced load testing patterns that trigger scaling appropriately ## ConfigMaps for API Definitions - Replace Tyk Operator with ConfigMaps for API definition management - Conditional deployment logic: operator disabled when ConfigMaps enabled - File-based API and policy definitions mounted via Kubernetes ConfigMaps - Improved reliability and simpler deployment without operator dependency ## k6 Load Testing Improvements - Default gradual traffic scaling pattern (baseline → 2x scale-up → scale-down) - Backward compatibility with existing SCENARIO-based tests - Enhanced performance monitoring with response validation and thresholds - Autoscaling-friendly traffic patterns with proper timing for HPA response ## Key Features - **Smart scenario selection**: Custom scenarios when SCENARIO provided, scaling pattern as default - **Conditional operator**: Tyk operator only deployed when not using ConfigMaps - **Volume mounts**: API definitions at /opt/tyk-gateway/apps, policies at /opt/tyk-gateway/policies - **Environment configuration**: Proper Tyk gateway configuration for file-based operation - **Variable flow**: Complete variable propagation from root to leaf modules 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add 'autoscaling-gradual' scenario to scenarios.js with 3-phase pattern - Set new scenario as default executor instead of constant-arrival-rate - Revert test script to original simple SCENARIO-based approach - Maintain backward compatibility with all existing scenarios - Update default test duration to 30 minutes for full scaling cycle This maintains the original architecture while making gradual scaling the default behavior through proper scenario selection. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Copy workflow files for Terraform state management: - terraform_reinit.yml: Reinitialize Terraform state - terraform_unlock.yml: Unlock single Terraform state - terraform_unlock_all.yml: Unlock all Terraform states - clear_terraform_state.yml: Clear Terraform state (already present) These workflows provide essential maintenance operations for managing Terraform state in CI/CD environments. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Set use_config_maps_for_apis = true as default in all variable definitions - Add explicit setting in deployments/main.tfvars.example - Users can still opt for operator by setting use_config_maps_for_apis = false This makes the more reliable ConfigMap approach the default while maintaining backward compatibility with the operator-based approach. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add step to display first 200 lines of Tyk Gateway pod logs - Helps diagnose startup issues and API mounting problems - Runs after deployment but before tests start 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Change default tests_executor from constant-arrival-rate to autoscaling-gradual - Update description to include the new scenario option - Ensures tests properly exercise autoscaling behavior by default 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add step to show last 200 lines of Tyk Gateway logs after tests complete - Helps diagnose any issues that occurred during load testing - Complements the pre-test logs for full visibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Using indexed set blocks for extraEnvs created sparse arrays with null entries, causing Kubernetes to reject deployments with "env[63].name: Required value" error. Solution (from BigBrain analysis): - Moved all extraEnvs to locals as a single list - Use yamlencode with values block instead of indexed set blocks - Ensures every env entry has both name and value properties - Eliminates sparse array issues that Helm creates with indexed writes This follows Helm best practices for passing structured data and prevents null placeholders in the final rendered container env list. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: The autoscaling-gradual scenario was incorrectly structured as an object with nested sub-scenarios (baseline_phase, scale_up_phase, scale_down_phase), which k6 doesn't recognize as a valid scenario format. This caused tests to not run at all - k6 CRD was created but never executed. Solution: Converted to a single ramping-arrival-rate scenario with all stages combined sequentially: - Baseline phase (0-5m): Ramp to and hold at 20k RPS - Scale up phase (5m-20m): Gradually increase from 20k to 40k RPS - Scale down phase (20m-30m): Gradually decrease back to 20k RPS This follows the proper k6 scenario structure and ensures tests execute. Confirmed via GitHub Actions logs - test CRD completed in 1s without running. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: API definitions were pointing to non-existent service
`upstream.upstream.svc.cluster.local:8080`, causing all requests
to fail with DNS lookup errors.
Solution: Updated target URL to match the actual deployed fortio services:
`fortio-${i % host_count}.tyk-upstream.svc:8080`
This matches the pattern used in the Operator version and ensures:
- APIs point to the correct fortio services in tyk-upstream namespace
- Load is distributed across multiple fortio instances using modulo
- Performance tests can actually reach the backend services
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changes to support HPA autoscaling visibility: 1. Increase services_nodes_count to 2 - provides CPU headroom for HPA to work (single node at 100% CPU prevents HPA from functioning) 2. Set test duration default to 30 minutes to match autoscaling-gradual scenario 3. Keep replica_count at 2 with HPA min=2, max=12 for proper scaling This configuration ensures: - HPA has CPU capacity to scale pods up and down - Test runs for full 30-minute autoscaling cycle - Grafana will show HPA responding to load changes 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
With resources_limits_cpu=0, pod CPU percentages use undefined denominators, making metrics confusing (4% pod vs 98% node). Setting explicit limits: - CPU request: 1 vCPU, limit: 2 vCPUs per pod - Memory request: 1Gi, limit: 2Gi per pod This ensures: - Pod CPU % = actual usage / 2 vCPUs (clear metric) - HPA can make informed scaling decisions - Node capacity planning is predictable With c2-standard-4 nodes (4 vCPUs), each node can handle 2 pods at max CPU. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
The workflows were not passing services_nodes_count variable when creating clusters, causing them to use the default value of 1 instead of the configured value of 2 from main.tfvars.example. This prevented HPA from working properly because a single node at 100% CPU couldn't accommodate additional pods for scaling. Fixed by explicitly passing --var="services_nodes_count=2" to terraform apply for all cloud providers (GKE, AKS, EKS). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Set CPU requests to 500m (was 0) to enable HPA percentage calculation - Set memory requests to 512Mi (was 0) for proper resource allocation - Set CPU limits to 2000m and memory limits to 2Gi - Reduce HPA CPU threshold from 80% to 60% for better demo visibility Without resource requests, HPA cannot calculate CPU utilization percentage, causing pods to remain stuck at minimum replicas despite high node CPU usage. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Adjust HPA threshold to 70% (balanced between 60% and 80%) - Reduce base load from 20k to 15k req/s for more realistic testing - Scale load pattern from 15k → 35k req/s (was 20k → 40k) - Increase API routes from 1 to 10 (still using 1 policy/app) - Update autoscaling-gradual scenario with fixed 35k peak target Load pattern now: - Baseline: 15k req/s - Peak: 35k req/s (fixed value to ensure exact target) - Gradual scaling through 20k, 25k, 30k steps This provides more realistic load levels and clearer HPA scaling demonstration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Maintains flexibility - if rate changes, the peak will scale proportionally. With rate=15000, this gives us exactly 34,950 ≈ 35k req/s at peak. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Keep it simple - rate * 2.33 works fine without rounding. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
With 3 nodes and HPA scaling from 2-12 pods, we can better demonstrate: - Initial distribution across 3 nodes - Pod scaling as load increases - More realistic production-like setup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Updated services_nodes_count from varying values to 3 in: - gke/main.tfvars.example (was 2) - aks/main.tfvars.example (was 1) - eks/main.tfvars.example (was 1) This ensures consistency with the GitHub Actions workflow and provides better load distribution across nodes for HPA scaling demonstrations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Added workflow inputs for optional node failure simulation - simulate_node_failure: boolean to enable/disable feature - node_failure_delay_minutes: configurable delay before termination - Implements cloud-specific node termination (Azure/AWS/GCP) - Runs as background process during test execution - Provides visibility into node termination and cluster recovery 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Updated GitHub Actions workflow to use 4 nodes - Updated all example configurations (GKE, AKS, EKS) - Provides better capacity for node failure simulation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Added test_duration_minutes workflow input (default 30, max 360) - Made autoscaling-gradual scenario duration-aware with proportional phases - Adjusted deployment stabilization wait time (5-15 min based on duration) - Scaled K6 setup timeout with test duration (10% of duration, min 300s) - Supports tests from 30 minutes to 6 hours Key changes: - Baseline phase: ~17% of total duration - Scale-up phase: ~50% of total duration - Scale-down phase: ~33% of total duration - Maintains same load profile (15k->35k->15k) regardless of duration 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
The node failure simulation was running but couldn't find gateway pods due to incorrect label selector. Fixed to use the correct selector: --selector=app=gateway-tyk-tyk-gateway This matches what's used in the 'Show Tyk Gateway logs' steps. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
The snapshot job was timing out because the timeout calculation was incorrect. For a 30-minute test: - Job waits 40 minutes (duration + buffer) before starting snapshot - Previous timeout: (30 + 10) * 2 = 80 minutes total - Job would timeout before completing snapshot generation Fixed to: duration + buffer + 20 minutes extra for snapshot generation New timeout for 30-min test: 30 + 10 + 20 = 60 minutes This gives enough time for the delay plus actual snapshot work. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Added a new Node Count panel next to the Gateway HPA panel to track: - Number of nodes per gateway type (Tyk, Kong, Gravitee, Traefik) - Total cluster nodes - Will show node failures clearly (e.g., drop from 4 to 3 nodes) This complements the HPA panel which shows pod count. While pods get rescheduled quickly after node failure, the node count will show the actual infrastructure reduction. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Added a new 'Pod Disruption Events' panel that tracks: - Pending pods (yellow) - pods waiting to be scheduled - ContainerCreating (orange) - pods being initialized - Terminating (red) - pods being shut down - Failed pods (dark red) - pods that failed to start - Restarts (purple bars) - container restart events This panel will clearly show disruption when a node fails: - Spike in Terminating pods when node is killed - Spike in Pending/ContainerCreating as pods reschedule - Possible restarts if pods crash during migration Reorganized Horizontal Scaling section layout: - Pod Disruption Events (left) - shows scheduling disruptions - Gateway HPA (middle) - shows pod counts - Node Count (right) - shows infrastructure changes Now you'll visually see the chaos when node failure occurs! 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Fixed several issues with the metrics queries: 1. Node Count panel: - Added fallback query using kube_node_status_condition for better node tracking - Should now properly show node count changes (4 -> 3 when node fails) 2. Pod Disruption Events panel: - Removed 'OR on() vector(0)' which was causing all metrics to show total pod count - These queries will now only show actual disrupted pods (not all pods) - Added 'New Pods Created' metric to track pod rescheduling events The issue was that 'OR on() vector(0)' returns 0 when there's no data, but when combined with count(), it was returning the total count instead. Now the queries will properly show 0 when there are no pods in those states, and actual counts when disruption occurs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Based on architect agent analysis, fixed critical issues: 1. Node Count Panel - Fixed regex pattern: - Was: .*tyk-np.* (didn't match GKE node names) - Now: .*-tyk-np-.* (matches gke-pt-us-east1-c-tyk-np-xxxxx) - Removed OR condition, using only kube_node_status_condition for accuracy - Applied same fix to all node pools (kong, gravitee, traefik) 2. Pod Disruption Events - Enhanced queries: - Terminating: Added > 0 filter to count only pods with deletion timestamp - New Pods: Changed from increase to rate * 120 for better visibility - Added Evicted metric to track pod evictions during node failure These fixes address why node count wasn't changing from 4→3 during node termination. The regex pattern was the key issue - it didn't match the actual GKE node naming convention. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Disable auto-repair for node pool before deletion - Use gcloud compute instances delete with --delete-disks=all flag - Run deletion in background for more abrupt failure - Add monitoring to track pod disruption impact - Show pod count on node before termination This creates a more realistic sudden node failure by preventing automatic recovery and ensuring complete VM deletion.
- Remove invalid --delete-disks=all flag - Force delete instance and wait for completion - Resize node pool down then up to control recovery timing - Better monitoring of node count and pod disruption - This creates true hard shutdown behavior with maximum impact
This reverts commit d943aae.
Feature/otel
Bumps tests_auth_key_count default from 100 to 10000 and introduces tests_auth_key_random_selection (default false) which decouples the Authorization token from the route index in the k6 script. Set it to true together with rate_limit_enabled=true to drive high-cardinality DRL bucket usage when reproducing the memorycache leak fixed in Tyk PR 8180. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip auth_enabled, rate_limit_enabled, and tests_auth_key_random_selection defaults from false to true so the existing Full Performance Test workflow (which doesn't expose tfvars inputs for these) drives the high-cardinality DRL bucket pattern needed to repro the Tyk PR 8180 memorycache leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes that together restore observability when a k6 segment crashes during setup: - Switch K6 CR cleanup from "post" to "pre". With "post" the operator deletes runner/initializer pods (and their logs) the moment the test ends, success or failure - so any setup() crash leaves no evidence to debug. "pre" still tidies up before the next test, but keeps this test's pods around long enough for the workflow's log-capture step to actually find something. - Tighten wait_for_k6_segment's "CR disappeared after being in 'started' -> success" heuristic. Run 25455997972 went from CR creation to disappearance in 32 seconds on a 60-minute test budget and the script reported success; the workflow turned green with zero load generated. Now require at least 70% of the segment budget to have elapsed before treating disappearance as completion, and on a too-fast disappearance dump runner / initializer / operator logs and fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementations of generateKeys / createKeys / createApplications / createSubscriptions in the Tyk, Kong, and Gravitee modules issued requests sequentially against the gateway admin APIs and called fail() on the first non-2xx response. With the new tests_auth_key_count default of 10000 that meant a single transient timeout or 5xx during k6 setup() killed the entire test run before any load was generated (run 25455997972 - K6 CR went from creation to operator cleanup in 32 seconds). Three changes per gateway: - Parallelism via http.batch in groups of 50, since k6 setup() is single-VU and that is the only way to drive concurrency. Cuts 10k sequential POSTs down to ~200 round trips. - Per-request retry loop with exponential backoff (4 attempts: initial + 3 retries at 100ms / 200ms / 400ms). A transient flake on one key no longer aborts the run. - Soft failure tolerance: up to TOLERANCE_PCT (1%) of keys may fail after retries; the run continues with whatever keys did succeed. Above the threshold we still call fail() loudly. Progress is logged every 1000 keys so the initializer/runner pod log shows forward motion in real time. Also stop the wait_for_k6_segment finished-branch from blocking on a 10-minute kubectl wait --for=delete: with cleanup: pre the operator keeps the CR (and its pods) around on purpose, so the wait was just dead time at the end of every successful segment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the default auth_type from authToken to JWT-HMAC and wire the Tyk configmap-path API definitions to honour it. The k6 setup() function now signs JWTs locally (HMAC-SHA256) instead of POSTing 10,000 keys to the Tyk dashboard - which means setup completes in seconds, can't be killed by a single transient dashboard 5xx, and needs no setupTimeout headroom. Three pieces: - Make generateJWTHMACKeys produce a unique sub per key (drop the "% 100" cycle). Tyk's JWT middleware uses jwt_identity_base_field to derive the session identity, so 10k unique subs map to 10k distinct Tyk sessions and therefore 10k distinct DRL rate-limit buckets - which is the high-cardinality scenario PR 8180 is about. - Mirror the JWT auth wiring from operator-api.tf into api-definitions.tf, so use_config_maps_for_apis=true (the default) also gets enable_jwt / jwt_signing_method / jwt_source / jwt_identity_base_field / jwt_policy_field_name / jwt_default_policies. JWT defaults are added via merge() only when auth.enabled and auth.type is JWT-HMAC or JWT-RSA, so the authToken path is untouched. - Add explicit _id and id fields to the policy JSON so file-based policy loading produces a deterministic policy ID for jwt_default_policies to reference. The batched authToken generators (Tyk/Kong/Gravitee) stay as-is - they're still useful for anyone running with auth_type=authToken, they're just no longer the default repro path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the OTel collector's metrics pipeline only logged received
metrics ("logging" exporter), so even though Tyk Gateway already had
TYK_GW_OPENTELEMETRY_METRICS_ENABLED wiring, nothing was queryable in
Grafana. Memory-leak regressions like Tyk PR 8180 are invisible to
the test infrastructure as a result.
Two changes:
- otel-collector.tf: add a prometheusremotewrite exporter pointing at
the same Prometheus k6 already writes to
(prometheus-server.dependencies.svc:80/api/v1/write), and switch
the metrics pipeline from logging -> prometheusremotewrite. Tyk
gateway runtime metrics (heap_inuse, heap_objects, goroutines,
gc_duration, etc.) now land next to k6_http_reqs_total in the same
Grafana datasource.
- vars.middleware.tf + main.tfvars.example: flip
open_telemetry_metrics_enabled and open_telemetry_runtime_metrics
defaults from false to true. The OTel collector deploys whenever
either traces or metrics is enabled, so this also brings the
collector up. Traces remain off by default - jaeger is heavy and
not needed for leak detection.
Follow-up commit will add a Grafana panel/row for memory & GC stats.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in the recent observability + parallelism work surfaced during a follow-up review: - cleanup: "pre" is invalid. The k6-operator TestRun CRD declares +kubebuilder:validation:Enum=post for the Cleanup field (see api/v1alpha1/testrun_types.go in grafana/k6-operator), so the API server would reject any K6 CR with cleanup: "pre" outright, and our terraform apply would have failed before any test ran. Omit the field instead - the operator then does no cleanup, which is what we wanted: pods persist for post-mortem logs and terraform destroy tidies them up between runs. - batch:20 / batchPerHost:6 are k6's default ceilings on http.batch. Even though our auth.js generators issue 50-wide batches, the effective concurrency was capped at 6 against any single host (the gateway admin API). Lift the option-level ceilings to 50 so the parallelism we asked for in d44bf4f is what actually flies. JWT-HMAC default path doesn't need this, but it's still correct for anyone running with auth_type=authToken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Gateway Memory & GC (Tyk OTel runtime metrics)" row to the existing k6 dashboard with six panels designed to make memory leaks of the Tyk PR 8180 family visible at a glance: - Go memory used (heap + stack), split by go_memory_type. The "other" bucket is the heap; on a leak it climbs steadily under flat traffic. - Go GC heap goal. Rising goal alongside rising memory.used confirms the heap is genuinely growing (not allocator slack). - Goroutines (go_goroutine_count). Monotonic climb under steady traffic is the smoking gun for goroutine / timer leaks. - Allocation rate (rate(go_memory_allocated_bytes_total[5m])). Flat allocation rate while memory.used climbs is the leak fingerprint. - Container working set (cAdvisor container_memory_working_set_bytes). The kernel's view - matches what gets OOM-killed. - Pod restart count. Steps up = OOM kill. Climb-then-cliff with the working-set panel is the unmistakable end-stage of a leak run. Metric names verified against the actual Tyk OTel instrumentation in ../tyk: - runtime contrib v0.67.0 emits the new go.* names (go.memory.used, go.goroutine.count, go.memory.allocated, go.memory.gc.goal). The legacy process.runtime.go.* / runtime.go.* set is gated behind OTEL_GO_X_DEPRECATED_RUNTIME_METRICS and off by default, so we cannot use heap_inuse / heap_objects / gc.pause_ns. - After prometheusremotewrite normalization (dots->underscores, units appended, monotonic counters get _total) the Prometheus names are go_memory_used_bytes, go_memory_gc_goal_bytes, go_goroutine_count, go_memory_allocated_bytes_total. These are what the panels query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
terraform plan failed in run 25503541943 with:
Error: Inconsistent conditional result types
on api-definitions.tf line 64
The 'true' value includes object attribute "enable_jwt", which is
absent in the 'false' value.
Terraform's strict typechecker rejects ?: arms whose object attribute
sets differ - so "enable_jwt ? { jwt_signing_method = ... } : {}"
fails because the empty object on the false branch has no attributes.
merge() inherits the same constraint when its arguments are themselves
typed objects.
The standard fix is to keep the schema static and conditionalize the
values: enable_jwt is always present (either true or false), and the
jwt_* fields are always present too with empty defaults when JWT is
off. Tyk ignores jwt_signing_method / jwt_source / jwt_default_policies
when enable_jwt is false, so this is functionally equivalent to the
omitted-field form.
Verified locally with terraform init -backend=false + terraform
validate in both deployments/ and tests/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment-only follow-up to 24d9ca7. The wait function comments still referenced "cleanup: post" and "cleanup: pre" but the K6 CR manifest no longer sets cleanup at all (those values either don't exist - "pre" - or destroy log evidence - "post"). Updated the comments so a future reader doesn't get misled into thinking the script is reasoning about an operator-driven CR deletion that never happens in normal operation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first run with metrics enabled (run on commit 6e71926) showed "Goroutines" populated but "Go memory used", "Go GC heap goal", and "Allocation rate" all empty. Confirmed cause: the OTel collector helm chart 0.62.0 ships an older collector (~v0.78.0) where the prometheusremotewrite exporter does NOT add the unit suffix to metric names by default - that became opt-in via add_metric_suffixes later. The OTel instrument go.memory.used (unit "By") therefore lands in Prometheus as go_memory_used, not go_memory_used_bytes. go_goroutine_count works because its UCUM unit "{goroutine}" is dropped, leaving no suffix to mismatch. Switch the three failing panels to regex matches on __name__ so they work with both naming variants (current collector and any future bump that turns suffixes back on): go_memory_used_bytes -> {__name__=~"go_memory_used(_bytes)?"} go_memory_gc_goal_bytes -> {__name__=~"go_memory_gc_goal(_bytes)?"} go_memory_allocated_bytes_total -> {__name__=~"go_memory_allocated(_bytes)?_total"} Counters keep _total because that's a separate exporter convention and is added regardless of add_metric_suffixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New toggle (default false): when true and auth_type=JWT-HMAC, the k6 default function signs a fresh JWT per request with a brand-new sub instead of picking from the pre-built keys pool. Each request gets a distinct Tyk session and therefore a distinct DRL bucket, so bucket cardinality grows linearly with iteration count rather than plateauing at tests_auth_key_count. This is the cleanest signal for memory-leak regressions in the DRL bucket store like Tyk PR 8180: - with the bug: gateway memory and goroutine-related metrics climb forever; eventually the working-set panel hits the pod limit and Pod restarts steps up. - without it: the cleanup goroutine evicts expired buckets and memory plateaus despite the unbounded sub stream. Implementation: - New helper signRollingJWT() in tests.js (same secret/encode plumbing as generateJWTHMACKeys, but sub uses __VU + __ITER + Math.random for per-call uniqueness). - script.js default() picks a token via signRollingJWT() when rolling=true, otherwise falls through to the existing random-from-pool / route-modulo branches. - setup() still pre-builds the 10k pool either way so DRL bucket count starts non-zero from request 1. Cost: HMAC-SHA256 in goja is on the order of tens of microseconds per call. Negligible at 15k rps but non-zero - latency p99 panel will absorb it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PR 8180 repro is the whole reason this branch exists - leaving rolling JWTs off by default would mean every dispatch produces a plateau-then-flat memory curve that doesn't actually exercise the leak. Flip the default to true so the next workflow_dispatch without any tfvars override drives unbounded session cardinality and gives the dashboard panels something interesting to show. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Allocation rate panel stayed empty in the run on commit 2f9135c even though Go memory used and Go GC heap goal worked, which proves the OTel collector v0.78.0 era prometheusremotewrite exporter on this chart isn't appending the _total suffix to monotonic counters either - go.memory.allocated lands as plain go_memory_allocated. Make the suffix optional in the regex so the query matches whatever Prometheus actually has. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes:
- Precompute the JWT header outside the per-request hot path. encode()
was calling JSON.stringify({typ:"JWT",alg:"HS256"}) and b64encode on
the same constant input on every signRollingJWT() invocation. At
25k rps that is wasted work even if each call is cheap; lift it to
a JWT_HEADER_B64 module-level constant. The HMAC and the payload
b64encode are irreducibly per-request, but at least the header is
not.
- Add two leak-detector panels that ignore ramping load. The original
"Go memory used" panel rises whenever traffic rises (the
autoscaling-gradual scenario ramps over the run), so it cannot
distinguish "more load arrived" from "we are leaking memory". The
new panels normalise by request rate and allocation rate
respectively:
Heap bytes per RPS:
sum(go_memory_used) / sum(rate(tyk_api_requests_total[5m]))
Heap bytes per allocation:
sum(go_memory_used) / sum(rate(go_memory_allocations[5m]))
Without a leak both metrics are approximately flat (each request
allocates and GC reclaims; steady state is constant). With a
PR 8180-family leak they rise linearly because old buckets never
get evicted, so the gateway carries more retained bytes per unit
of in-flight work. Heap-bytes-per-RPS is the most useful panel
for spotting leaks when load is not constant.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a use_jwt boolean input to the Full Performance Test workflow (default false). When unchecked, the deploy step gets auth_type=authToken and the tests step gets tests_auth_key_rolling=false - that's the repo's historical default behavior (API keys minted via the Tyk dashboard, picked from a pre-built pool). When checked, both flip to JWT-HMAC and rolling sub-per-request, which is the high-cardinality DRL-bucket repro path. Also flip the underlying tfvars defaults back so a tfvars-less "terraform apply" reproduces the historical baseline rather than the PR 8180 repro config. The workflow checkbox is the only way to get JWT mode now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction Three coupled config changes that together make the leak-detection panels readable as-is, without needing to reason about ramping artefacts or HPA pod churn: - tests_executor: autoscaling-gradual -> constant-arrival-rate. Under constant load, any positive slope on heap-bytes-per-RPS or heap-bytes-per-allocation is unambiguously a leak (no denominator changes to mask it). - hpa_enabled: true -> false. HPA was hiding the leak: when a pod's cache filled the gateway slowed -> HPA scaled up a fresh pod with empty cache and routed traffic there; on ramp-down HPA terminated pods and freed the leaked memory. With a fixed pod set, each pod's individual heap climb is observable for the entire run. - replica_count: 2 -> 6. Matches the steady-state pod count the previous HPA-driven runs settled into at ~20k rps, so capacity doesn't change but the autoscaler is out of the loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cf377f6 bumped replica_count from 2 to 6 alongside the autoscaling -> constant-arrival-rate switch, on the theory that disabling HPA needed compensating fixed capacity. That was overreach for a default: 2 replicas is enough for low-rate functional runs, and users actually running sustained high-rate leak tests should make a conscious choice about pod count + tests_rate together rather than inheriting an opinionated triple-the-baseline value. Documentation in the variable description now explains the rough sizing rule for when to bump it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the open-ended >= 2.0.4 constraint with an exact pin to prevent Terraform from resolving to unavailable releases during terraform init.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces comprehensive improvements to the performance testing infrastructure with three major enhancements:
🚀 POD Autoscaling (HPA) Enhancements
📦 ConfigMaps for API Definitions
📊 k6 Load Testing Improvements
Key Changes
Files Modified:
deployments/main.tfvars.example,deployments/vars.performance.tfmodules/deployments/tyk/api-definitions.tf(new),modules/deployments/tyk/operator.tf,modules/deployments/tyk/operator-api.tf,modules/deployments/tyk/main.tfmodules/tests/test/main.tfdeployments/main.tf,modules/deployments/main.tf,modules/deployments/vars.tf,modules/deployments/tyk/vars.tfTechnical Details:
SCENARIOprovided, scaling pattern as defaultuse_config_maps_for_apis=false/opt/tyk-gateway/apps, policies at/opt/tyk-gateway/policiesTest Plan
use_config_maps_for_apis=trueuse_config_maps_for_apis=false🤖 Generated with Claude Code