Skip to content

Implement POD autoscaling and ConfigMaps for API definitions#26

Open
buger wants to merge 135 commits into
mainfrom
feature/pod-autoscaling-and-configmaps
Open

Implement POD autoscaling and ConfigMaps for API definitions#26
buger wants to merge 135 commits into
mainfrom
feature/pod-autoscaling-and-configmaps

Conversation

@buger
Copy link
Copy Markdown
Member

@buger buger commented Aug 16, 2025

Summary

This PR introduces comprehensive improvements to the performance testing infrastructure with three major enhancements:

🚀 POD Autoscaling (HPA) Enhancements

  • Enable HPA by default with increased replica limits (2-12 replicas)
  • Better autoscaling configuration for performance testing scenarios
  • Enhanced load testing patterns that properly trigger scaling

📦 ConfigMaps for API Definitions

  • Replace Tyk Operator with ConfigMaps for API definition management
  • Conditional deployment logic: operator disabled when ConfigMaps enabled
  • File-based API and policy definitions mounted via Kubernetes ConfigMaps
  • Improved reliability and simpler deployment without operator dependency

📊 k6 Load Testing Improvements

  • Default gradual traffic scaling pattern (baseline → 2x scale-up → scale-down)
  • Backward compatibility with existing SCENARIO-based tests
  • Enhanced performance monitoring with response validation and thresholds
  • Autoscaling-friendly traffic patterns with proper timing for HPA response

Key Changes

Files Modified:

  • POD Autoscaling: deployments/main.tfvars.example, deployments/vars.performance.tf
  • ConfigMaps: modules/deployments/tyk/api-definitions.tf (new), modules/deployments/tyk/operator.tf, modules/deployments/tyk/operator-api.tf, modules/deployments/tyk/main.tf
  • Load Testing: modules/tests/test/main.tf
  • Variable Flow: deployments/main.tf, modules/deployments/main.tf, modules/deployments/vars.tf, modules/deployments/tyk/vars.tf

Technical Details:

  • Smart scenario selection: Custom scenarios when SCENARIO provided, scaling pattern as default
  • Conditional operator: Tyk operator only deployed when use_config_maps_for_apis=false
  • Volume mounts: API definitions at /opt/tyk-gateway/apps, policies at /opt/tyk-gateway/policies
  • Environment configuration: Proper Tyk gateway configuration for file-based operation
  • Complete variable flow: From root level to leaf modules with proper defaults

Test Plan

  • Verify HPA scaling with increased traffic
  • Test ConfigMaps mode: use_config_maps_for_apis=true
  • Test operator mode: use_config_maps_for_apis=false
  • Verify backward compatibility with existing SCENARIO tests
  • Test new gradual scaling pattern as default
  • Validate API definitions are properly mounted and accessible

🤖 Generated with Claude Code

buger and others added 30 commits August 16, 2025 07:57
This commit introduces comprehensive improvements to the performance testing infrastructure:

## POD Autoscaling (HPA) Enhancements
- Enable HPA by default with increased replica limits (2-12 replicas)
- Improved autoscaling configuration for better performance testing
- Enhanced load testing patterns that trigger scaling appropriately

## ConfigMaps for API Definitions
- Replace Tyk Operator with ConfigMaps for API definition management
- Conditional deployment logic: operator disabled when ConfigMaps enabled
- File-based API and policy definitions mounted via Kubernetes ConfigMaps
- Improved reliability and simpler deployment without operator dependency

## k6 Load Testing Improvements
- Default gradual traffic scaling pattern (baseline → 2x scale-up → scale-down)
- Backward compatibility with existing SCENARIO-based tests
- Enhanced performance monitoring with response validation and thresholds
- Autoscaling-friendly traffic patterns with proper timing for HPA response

## Key Features
- **Smart scenario selection**: Custom scenarios when SCENARIO provided, scaling pattern as default
- **Conditional operator**: Tyk operator only deployed when not using ConfigMaps
- **Volume mounts**: API definitions at /opt/tyk-gateway/apps, policies at /opt/tyk-gateway/policies
- **Environment configuration**: Proper Tyk gateway configuration for file-based operation
- **Variable flow**: Complete variable propagation from root to leaf modules

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add 'autoscaling-gradual' scenario to scenarios.js with 3-phase pattern
- Set new scenario as default executor instead of constant-arrival-rate
- Revert test script to original simple SCENARIO-based approach
- Maintain backward compatibility with all existing scenarios
- Update default test duration to 30 minutes for full scaling cycle

This maintains the original architecture while making gradual scaling
the default behavior through proper scenario selection.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy workflow files for Terraform state management:
- terraform_reinit.yml: Reinitialize Terraform state
- terraform_unlock.yml: Unlock single Terraform state
- terraform_unlock_all.yml: Unlock all Terraform states
- clear_terraform_state.yml: Clear Terraform state (already present)

These workflows provide essential maintenance operations for
managing Terraform state in CI/CD environments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Set use_config_maps_for_apis = true as default in all variable definitions
- Add explicit setting in deployments/main.tfvars.example
- Users can still opt for operator by setting use_config_maps_for_apis = false

This makes the more reliable ConfigMap approach the default while
maintaining backward compatibility with the operator-based approach.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add step to display first 200 lines of Tyk Gateway pod logs
- Helps diagnose startup issues and API mounting problems
- Runs after deployment but before tests start

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Change default tests_executor from constant-arrival-rate to autoscaling-gradual
- Update description to include the new scenario option
- Ensures tests properly exercise autoscaling behavior by default

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add step to show last 200 lines of Tyk Gateway logs after tests complete
- Helps diagnose any issues that occurred during load testing
- Complements the pre-test logs for full visibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Using indexed set blocks for extraEnvs created sparse arrays with
null entries, causing Kubernetes to reject deployments with "env[63].name:
Required value" error.

Solution (from BigBrain analysis):
- Moved all extraEnvs to locals as a single list
- Use yamlencode with values block instead of indexed set blocks
- Ensures every env entry has both name and value properties
- Eliminates sparse array issues that Helm creates with indexed writes

This follows Helm best practices for passing structured data and prevents
null placeholders in the final rendered container env list.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem: The autoscaling-gradual scenario was incorrectly structured as
an object with nested sub-scenarios (baseline_phase, scale_up_phase,
scale_down_phase), which k6 doesn't recognize as a valid scenario format.
This caused tests to not run at all - k6 CRD was created but never executed.

Solution: Converted to a single ramping-arrival-rate scenario with all
stages combined sequentially:
- Baseline phase (0-5m): Ramp to and hold at 20k RPS
- Scale up phase (5m-20m): Gradually increase from 20k to 40k RPS
- Scale down phase (20m-30m): Gradually decrease back to 20k RPS

This follows the proper k6 scenario structure and ensures tests execute.

Confirmed via GitHub Actions logs - test CRD completed in 1s without running.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem: API definitions were pointing to non-existent service
`upstream.upstream.svc.cluster.local:8080`, causing all requests
to fail with DNS lookup errors.

Solution: Updated target URL to match the actual deployed fortio services:
`fortio-${i % host_count}.tyk-upstream.svc:8080`

This matches the pattern used in the Operator version and ensures:
- APIs point to the correct fortio services in tyk-upstream namespace
- Load is distributed across multiple fortio instances using modulo
- Performance tests can actually reach the backend services

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes to support HPA autoscaling visibility:
1. Increase services_nodes_count to 2 - provides CPU headroom for HPA to work
   (single node at 100% CPU prevents HPA from functioning)
2. Set test duration default to 30 minutes to match autoscaling-gradual scenario
3. Keep replica_count at 2 with HPA min=2, max=12 for proper scaling

This configuration ensures:
- HPA has CPU capacity to scale pods up and down
- Test runs for full 30-minute autoscaling cycle
- Grafana will show HPA responding to load changes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
With resources_limits_cpu=0, pod CPU percentages use undefined denominators,
making metrics confusing (4% pod vs 98% node). Setting explicit limits:
- CPU request: 1 vCPU, limit: 2 vCPUs per pod
- Memory request: 1Gi, limit: 2Gi per pod

This ensures:
- Pod CPU % = actual usage / 2 vCPUs (clear metric)
- HPA can make informed scaling decisions
- Node capacity planning is predictable

With c2-standard-4 nodes (4 vCPUs), each node can handle 2 pods at max CPU.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The workflows were not passing services_nodes_count variable when creating
clusters, causing them to use the default value of 1 instead of the
configured value of 2 from main.tfvars.example.

This prevented HPA from working properly because a single node at 100% CPU
couldn't accommodate additional pods for scaling.

Fixed by explicitly passing --var="services_nodes_count=2" to terraform
apply for all cloud providers (GKE, AKS, EKS).

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Set CPU requests to 500m (was 0) to enable HPA percentage calculation
- Set memory requests to 512Mi (was 0) for proper resource allocation
- Set CPU limits to 2000m and memory limits to 2Gi
- Reduce HPA CPU threshold from 80% to 60% for better demo visibility

Without resource requests, HPA cannot calculate CPU utilization percentage,
causing pods to remain stuck at minimum replicas despite high node CPU usage.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Adjust HPA threshold to 70% (balanced between 60% and 80%)
- Reduce base load from 20k to 15k req/s for more realistic testing
- Scale load pattern from 15k → 35k req/s (was 20k → 40k)
- Increase API routes from 1 to 10 (still using 1 policy/app)
- Update autoscaling-gradual scenario with fixed 35k peak target

Load pattern now:
- Baseline: 15k req/s
- Peak: 35k req/s (fixed value to ensure exact target)
- Gradual scaling through 20k, 25k, 30k steps

This provides more realistic load levels and clearer HPA scaling demonstration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Maintains flexibility - if rate changes, the peak will scale proportionally.
With rate=15000, this gives us exactly 34,950 ≈ 35k req/s at peak.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Keep it simple - rate * 2.33 works fine without rounding.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
With 3 nodes and HPA scaling from 2-12 pods, we can better demonstrate:
- Initial distribution across 3 nodes
- Pod scaling as load increases
- More realistic production-like setup

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updated services_nodes_count from varying values to 3 in:
- gke/main.tfvars.example (was 2)
- aks/main.tfvars.example (was 1)
- eks/main.tfvars.example (was 1)

This ensures consistency with the GitHub Actions workflow and provides
better load distribution across nodes for HPA scaling demonstrations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added workflow inputs for optional node failure simulation
- simulate_node_failure: boolean to enable/disable feature
- node_failure_delay_minutes: configurable delay before termination
- Implements cloud-specific node termination (Azure/AWS/GCP)
- Runs as background process during test execution
- Provides visibility into node termination and cluster recovery

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Updated GitHub Actions workflow to use 4 nodes
- Updated all example configurations (GKE, AKS, EKS)
- Provides better capacity for node failure simulation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added test_duration_minutes workflow input (default 30, max 360)
- Made autoscaling-gradual scenario duration-aware with proportional phases
- Adjusted deployment stabilization wait time (5-15 min based on duration)
- Scaled K6 setup timeout with test duration (10% of duration, min 300s)
- Supports tests from 30 minutes to 6 hours

Key changes:
- Baseline phase: ~17% of total duration
- Scale-up phase: ~50% of total duration
- Scale-down phase: ~33% of total duration
- Maintains same load profile (15k->35k->15k) regardless of duration

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The node failure simulation was running but couldn't find gateway pods
due to incorrect label selector. Fixed to use the correct selector:
--selector=app=gateway-tyk-tyk-gateway

This matches what's used in the 'Show Tyk Gateway logs' steps.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The snapshot job was timing out because the timeout calculation was incorrect.
For a 30-minute test:
- Job waits 40 minutes (duration + buffer) before starting snapshot
- Previous timeout: (30 + 10) * 2 = 80 minutes total
- Job would timeout before completing snapshot generation

Fixed to: duration + buffer + 20 minutes extra for snapshot generation
New timeout for 30-min test: 30 + 10 + 20 = 60 minutes
This gives enough time for the delay plus actual snapshot work.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added a new Node Count panel next to the Gateway HPA panel to track:
- Number of nodes per gateway type (Tyk, Kong, Gravitee, Traefik)
- Total cluster nodes
- Will show node failures clearly (e.g., drop from 4 to 3 nodes)

This complements the HPA panel which shows pod count. While pods get
rescheduled quickly after node failure, the node count will show the
actual infrastructure reduction.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added a new 'Pod Disruption Events' panel that tracks:
- Pending pods (yellow) - pods waiting to be scheduled
- ContainerCreating (orange) - pods being initialized
- Terminating (red) - pods being shut down
- Failed pods (dark red) - pods that failed to start
- Restarts (purple bars) - container restart events

This panel will clearly show disruption when a node fails:
- Spike in Terminating pods when node is killed
- Spike in Pending/ContainerCreating as pods reschedule
- Possible restarts if pods crash during migration

Reorganized Horizontal Scaling section layout:
- Pod Disruption Events (left) - shows scheduling disruptions
- Gateway HPA (middle) - shows pod counts
- Node Count (right) - shows infrastructure changes

Now you'll visually see the chaos when node failure occurs!

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed several issues with the metrics queries:

1. Node Count panel:
   - Added fallback query using kube_node_status_condition for better node tracking
   - Should now properly show node count changes (4 -> 3 when node fails)

2. Pod Disruption Events panel:
   - Removed 'OR on() vector(0)' which was causing all metrics to show total pod count
   - These queries will now only show actual disrupted pods (not all pods)
   - Added 'New Pods Created' metric to track pod rescheduling events

The issue was that 'OR on() vector(0)' returns 0 when there's no data, but when
combined with count(), it was returning the total count instead. Now the queries
will properly show 0 when there are no pods in those states, and actual counts
when disruption occurs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Based on architect agent analysis, fixed critical issues:

1. Node Count Panel - Fixed regex pattern:
   - Was: .*tyk-np.* (didn't match GKE node names)
   - Now: .*-tyk-np-.* (matches gke-pt-us-east1-c-tyk-np-xxxxx)
   - Removed OR condition, using only kube_node_status_condition for accuracy
   - Applied same fix to all node pools (kong, gravitee, traefik)

2. Pod Disruption Events - Enhanced queries:
   - Terminating: Added > 0 filter to count only pods with deletion timestamp
   - New Pods: Changed from increase to rate * 120 for better visibility
   - Added Evicted metric to track pod evictions during node failure

These fixes address why node count wasn't changing from 4→3 during node
termination. The regex pattern was the key issue - it didn't match the
actual GKE node naming convention.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Disable auto-repair for node pool before deletion
- Use gcloud compute instances delete with --delete-disks=all flag
- Run deletion in background for more abrupt failure
- Add monitoring to track pod disruption impact
- Show pod count on node before termination

This creates a more realistic sudden node failure by preventing
automatic recovery and ensuring complete VM deletion.
- Remove invalid --delete-disks=all flag
- Force delete instance and wait for completion
- Resize node pool down then up to control recovery timing
- Better monitoring of node count and pod disruption
- This creates true hard shutdown behavior with maximum impact
tbuchaillot and others added 30 commits March 17, 2026 16:12
This reverts commit d943aae.
Bumps tests_auth_key_count default from 100 to 10000 and introduces
tests_auth_key_random_selection (default false) which decouples the
Authorization token from the route index in the k6 script. Set it to
true together with rate_limit_enabled=true to drive high-cardinality
DRL bucket usage when reproducing the memorycache leak fixed in Tyk
PR 8180.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip auth_enabled, rate_limit_enabled, and tests_auth_key_random_selection
defaults from false to true so the existing Full Performance Test workflow
(which doesn't expose tfvars inputs for these) drives the high-cardinality
DRL bucket pattern needed to repro the Tyk PR 8180 memorycache leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes that together restore observability when a k6 segment crashes
during setup:

- Switch K6 CR cleanup from "post" to "pre". With "post" the operator
  deletes runner/initializer pods (and their logs) the moment the test
  ends, success or failure - so any setup() crash leaves no evidence
  to debug. "pre" still tidies up before the next test, but keeps this
  test's pods around long enough for the workflow's log-capture step
  to actually find something.

- Tighten wait_for_k6_segment's "CR disappeared after being in
  'started' -> success" heuristic. Run 25455997972 went from CR
  creation to disappearance in 32 seconds on a 60-minute test budget
  and the script reported success; the workflow turned green with
  zero load generated. Now require at least 70% of the segment
  budget to have elapsed before treating disappearance as
  completion, and on a too-fast disappearance dump runner /
  initializer / operator logs and fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementations of generateKeys / createKeys /
createApplications / createSubscriptions in the Tyk, Kong, and
Gravitee modules issued requests sequentially against the gateway
admin APIs and called fail() on the first non-2xx response. With the
new tests_auth_key_count default of 10000 that meant a single
transient timeout or 5xx during k6 setup() killed the entire test
run before any load was generated (run 25455997972 - K6 CR went
from creation to operator cleanup in 32 seconds).

Three changes per gateway:

- Parallelism via http.batch in groups of 50, since k6 setup() is
  single-VU and that is the only way to drive concurrency. Cuts 10k
  sequential POSTs down to ~200 round trips.

- Per-request retry loop with exponential backoff (4 attempts:
  initial + 3 retries at 100ms / 200ms / 400ms). A transient
  flake on one key no longer aborts the run.

- Soft failure tolerance: up to TOLERANCE_PCT (1%) of keys may fail
  after retries; the run continues with whatever keys did succeed.
  Above the threshold we still call fail() loudly. Progress is
  logged every 1000 keys so the initializer/runner pod log shows
  forward motion in real time.

Also stop the wait_for_k6_segment finished-branch from blocking on a
10-minute kubectl wait --for=delete: with cleanup: pre the operator
keeps the CR (and its pods) around on purpose, so the wait was just
dead time at the end of every successful segment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the default auth_type from authToken to JWT-HMAC and wire the
Tyk configmap-path API definitions to honour it. The k6 setup()
function now signs JWTs locally (HMAC-SHA256) instead of POSTing
10,000 keys to the Tyk dashboard - which means setup completes in
seconds, can't be killed by a single transient dashboard 5xx, and
needs no setupTimeout headroom.

Three pieces:

- Make generateJWTHMACKeys produce a unique sub per key (drop the
  "% 100" cycle). Tyk's JWT middleware uses jwt_identity_base_field
  to derive the session identity, so 10k unique subs map to 10k
  distinct Tyk sessions and therefore 10k distinct DRL rate-limit
  buckets - which is the high-cardinality scenario PR 8180 is about.

- Mirror the JWT auth wiring from operator-api.tf into
  api-definitions.tf, so use_config_maps_for_apis=true (the default)
  also gets enable_jwt / jwt_signing_method / jwt_source /
  jwt_identity_base_field / jwt_policy_field_name /
  jwt_default_policies. JWT defaults are added via merge() only when
  auth.enabled and auth.type is JWT-HMAC or JWT-RSA, so the
  authToken path is untouched.

- Add explicit _id and id fields to the policy JSON so file-based
  policy loading produces a deterministic policy ID for
  jwt_default_policies to reference.

The batched authToken generators (Tyk/Kong/Gravitee) stay as-is -
they're still useful for anyone running with auth_type=authToken,
they're just no longer the default repro path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the OTel collector's metrics pipeline only logged received
metrics ("logging" exporter), so even though Tyk Gateway already had
TYK_GW_OPENTELEMETRY_METRICS_ENABLED wiring, nothing was queryable in
Grafana. Memory-leak regressions like Tyk PR 8180 are invisible to
the test infrastructure as a result.

Two changes:

- otel-collector.tf: add a prometheusremotewrite exporter pointing at
  the same Prometheus k6 already writes to
  (prometheus-server.dependencies.svc:80/api/v1/write), and switch
  the metrics pipeline from logging -> prometheusremotewrite. Tyk
  gateway runtime metrics (heap_inuse, heap_objects, goroutines,
  gc_duration, etc.) now land next to k6_http_reqs_total in the same
  Grafana datasource.

- vars.middleware.tf + main.tfvars.example: flip
  open_telemetry_metrics_enabled and open_telemetry_runtime_metrics
  defaults from false to true. The OTel collector deploys whenever
  either traces or metrics is enabled, so this also brings the
  collector up. Traces remain off by default - jaeger is heavy and
  not needed for leak detection.

Follow-up commit will add a Grafana panel/row for memory & GC stats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in the recent observability + parallelism work surfaced
during a follow-up review:

- cleanup: "pre" is invalid. The k6-operator TestRun CRD declares
  +kubebuilder:validation:Enum=post for the Cleanup field (see
  api/v1alpha1/testrun_types.go in grafana/k6-operator), so the
  API server would reject any K6 CR with cleanup: "pre" outright,
  and our terraform apply would have failed before any test ran.
  Omit the field instead - the operator then does no cleanup,
  which is what we wanted: pods persist for post-mortem logs and
  terraform destroy tidies them up between runs.

- batch:20 / batchPerHost:6 are k6's default ceilings on http.batch.
  Even though our auth.js generators issue 50-wide batches, the
  effective concurrency was capped at 6 against any single host
  (the gateway admin API). Lift the option-level ceilings to 50
  so the parallelism we asked for in d44bf4f is what actually
  flies. JWT-HMAC default path doesn't need this, but it's still
  correct for anyone running with auth_type=authToken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Gateway Memory & GC (Tyk OTel runtime metrics)" row to the
existing k6 dashboard with six panels designed to make memory leaks
of the Tyk PR 8180 family visible at a glance:

- Go memory used (heap + stack), split by go_memory_type. The "other"
  bucket is the heap; on a leak it climbs steadily under flat traffic.
- Go GC heap goal. Rising goal alongside rising memory.used confirms
  the heap is genuinely growing (not allocator slack).
- Goroutines (go_goroutine_count). Monotonic climb under steady
  traffic is the smoking gun for goroutine / timer leaks.
- Allocation rate (rate(go_memory_allocated_bytes_total[5m])). Flat
  allocation rate while memory.used climbs is the leak fingerprint.
- Container working set (cAdvisor container_memory_working_set_bytes).
  The kernel's view - matches what gets OOM-killed.
- Pod restart count. Steps up = OOM kill. Climb-then-cliff with the
  working-set panel is the unmistakable end-stage of a leak run.

Metric names verified against the actual Tyk OTel instrumentation in
../tyk:
- runtime contrib v0.67.0 emits the new go.* names (go.memory.used,
  go.goroutine.count, go.memory.allocated, go.memory.gc.goal). The
  legacy process.runtime.go.* / runtime.go.* set is gated behind
  OTEL_GO_X_DEPRECATED_RUNTIME_METRICS and off by default, so we
  cannot use heap_inuse / heap_objects / gc.pause_ns.
- After prometheusremotewrite normalization (dots->underscores, units
  appended, monotonic counters get _total) the Prometheus names are
  go_memory_used_bytes, go_memory_gc_goal_bytes, go_goroutine_count,
  go_memory_allocated_bytes_total. These are what the panels query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
terraform plan failed in run 25503541943 with:
  Error: Inconsistent conditional result types
  on api-definitions.tf line 64
  The 'true' value includes object attribute "enable_jwt", which is
  absent in the 'false' value.

Terraform's strict typechecker rejects ?: arms whose object attribute
sets differ - so "enable_jwt ? { jwt_signing_method = ... } : {}"
fails because the empty object on the false branch has no attributes.
merge() inherits the same constraint when its arguments are themselves
typed objects.

The standard fix is to keep the schema static and conditionalize the
values: enable_jwt is always present (either true or false), and the
jwt_* fields are always present too with empty defaults when JWT is
off. Tyk ignores jwt_signing_method / jwt_source / jwt_default_policies
when enable_jwt is false, so this is functionally equivalent to the
omitted-field form.

Verified locally with terraform init -backend=false + terraform
validate in both deployments/ and tests/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment-only follow-up to 24d9ca7. The wait function comments still
referenced "cleanup: post" and "cleanup: pre" but the K6 CR
manifest no longer sets cleanup at all (those values either don't
exist - "pre" - or destroy log evidence - "post"). Updated the
comments so a future reader doesn't get misled into thinking the
script is reasoning about an operator-driven CR deletion that never
happens in normal operation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first run with metrics enabled (run on commit 6e71926) showed
"Goroutines" populated but "Go memory used", "Go GC heap goal", and
"Allocation rate" all empty. Confirmed cause: the OTel collector
helm chart 0.62.0 ships an older collector (~v0.78.0) where the
prometheusremotewrite exporter does NOT add the unit suffix to
metric names by default - that became opt-in via add_metric_suffixes
later. The OTel instrument go.memory.used (unit "By") therefore
lands in Prometheus as go_memory_used, not go_memory_used_bytes.
go_goroutine_count works because its UCUM unit "{goroutine}" is
dropped, leaving no suffix to mismatch.

Switch the three failing panels to regex matches on __name__ so
they work with both naming variants (current collector and any
future bump that turns suffixes back on):

  go_memory_used_bytes        -> {__name__=~"go_memory_used(_bytes)?"}
  go_memory_gc_goal_bytes     -> {__name__=~"go_memory_gc_goal(_bytes)?"}
  go_memory_allocated_bytes_total -> {__name__=~"go_memory_allocated(_bytes)?_total"}

Counters keep _total because that's a separate exporter convention
and is added regardless of add_metric_suffixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New toggle (default false): when true and auth_type=JWT-HMAC, the k6
default function signs a fresh JWT per request with a brand-new sub
instead of picking from the pre-built keys pool. Each request gets a
distinct Tyk session and therefore a distinct DRL bucket, so bucket
cardinality grows linearly with iteration count rather than plateauing
at tests_auth_key_count.

This is the cleanest signal for memory-leak regressions in the DRL
bucket store like Tyk PR 8180:
- with the bug: gateway memory and goroutine-related metrics climb
  forever; eventually the working-set panel hits the pod limit and
  Pod restarts steps up.
- without it: the cleanup goroutine evicts expired buckets and memory
  plateaus despite the unbounded sub stream.

Implementation:
- New helper signRollingJWT() in tests.js (same secret/encode plumbing
  as generateJWTHMACKeys, but sub uses __VU + __ITER + Math.random
  for per-call uniqueness).
- script.js default() picks a token via signRollingJWT() when
  rolling=true, otherwise falls through to the existing
  random-from-pool / route-modulo branches.
- setup() still pre-builds the 10k pool either way so DRL bucket
  count starts non-zero from request 1.

Cost: HMAC-SHA256 in goja is on the order of tens of microseconds per
call. Negligible at 15k rps but non-zero - latency p99 panel will
absorb it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PR 8180 repro is the whole reason this branch exists - leaving
rolling JWTs off by default would mean every dispatch produces a
plateau-then-flat memory curve that doesn't actually exercise the
leak. Flip the default to true so the next workflow_dispatch
without any tfvars override drives unbounded session cardinality
and gives the dashboard panels something interesting to show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Allocation rate panel stayed empty in the run on commit 2f9135c
even though Go memory used and Go GC heap goal worked, which proves
the OTel collector v0.78.0 era prometheusremotewrite exporter on
this chart isn't appending the _total suffix to monotonic counters
either - go.memory.allocated lands as plain go_memory_allocated.
Make the suffix optional in the regex so the query matches whatever
Prometheus actually has.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes:

- Precompute the JWT header outside the per-request hot path. encode()
  was calling JSON.stringify({typ:"JWT",alg:"HS256"}) and b64encode on
  the same constant input on every signRollingJWT() invocation. At
  25k rps that is wasted work even if each call is cheap; lift it to
  a JWT_HEADER_B64 module-level constant. The HMAC and the payload
  b64encode are irreducibly per-request, but at least the header is
  not.

- Add two leak-detector panels that ignore ramping load. The original
  "Go memory used" panel rises whenever traffic rises (the
  autoscaling-gradual scenario ramps over the run), so it cannot
  distinguish "more load arrived" from "we are leaking memory". The
  new panels normalise by request rate and allocation rate
  respectively:

    Heap bytes per RPS:
      sum(go_memory_used) / sum(rate(tyk_api_requests_total[5m]))

    Heap bytes per allocation:
      sum(go_memory_used) / sum(rate(go_memory_allocations[5m]))

  Without a leak both metrics are approximately flat (each request
  allocates and GC reclaims; steady state is constant). With a
  PR 8180-family leak they rise linearly because old buckets never
  get evicted, so the gateway carries more retained bytes per unit
  of in-flight work. Heap-bytes-per-RPS is the most useful panel
  for spotting leaks when load is not constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a use_jwt boolean input to the Full Performance Test workflow
(default false). When unchecked, the deploy step gets
auth_type=authToken and the tests step gets
tests_auth_key_rolling=false - that's the repo's historical default
behavior (API keys minted via the Tyk dashboard, picked from a
pre-built pool). When checked, both flip to JWT-HMAC and rolling
sub-per-request, which is the high-cardinality DRL-bucket repro path.

Also flip the underlying tfvars defaults back so a tfvars-less
"terraform apply" reproduces the historical baseline rather than the
PR 8180 repro config. The workflow checkbox is the only way to get
JWT mode now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction

Three coupled config changes that together make the leak-detection
panels readable as-is, without needing to reason about ramping
artefacts or HPA pod churn:

- tests_executor: autoscaling-gradual -> constant-arrival-rate.
  Under constant load, any positive slope on heap-bytes-per-RPS or
  heap-bytes-per-allocation is unambiguously a leak (no denominator
  changes to mask it).
- hpa_enabled: true -> false. HPA was hiding the leak: when a pod's
  cache filled the gateway slowed -> HPA scaled up a fresh pod with
  empty cache and routed traffic there; on ramp-down HPA terminated
  pods and freed the leaked memory. With a fixed pod set, each pod's
  individual heap climb is observable for the entire run.
- replica_count: 2 -> 6. Matches the steady-state pod count the
  previous HPA-driven runs settled into at ~20k rps, so capacity
  doesn't change but the autoscaler is out of the loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cf377f6 bumped replica_count from 2 to 6 alongside the autoscaling
-> constant-arrival-rate switch, on the theory that disabling HPA
needed compensating fixed capacity. That was overreach for a default:
2 replicas is enough for low-rate functional runs, and users actually
running sustained high-rate leak tests should make a conscious choice
about pod count + tests_rate together rather than inheriting an
opinionated triple-the-baseline value. Documentation in the variable
description now explains the rough sizing rule for when to bump it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the open-ended >= 2.0.4 constraint with an exact pin to prevent
Terraform from resolving to unavailable releases during terraform init.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants