[Proposal] Add Prometheus Metrics for Workload Manager and Router Observability

## Background

The v0.1.0 release blog ([docs/agentcube/blog/release-v0.1.0/index.md](https://github.com/volcano-sh/agentcube/blob/main/docs/agentcube/blog/release-v0.1.0/index.md)) lists under "Features and Enhancements":

> **Prometheus metrics**: metrics exported by Router and Workload Manager for operational observability

The PicoD design proposal ([docs/design/picod-proposal.md](https://github.com/volcano-sh/agentcube/blob/main/docs/design/picod-proposal.md)) also mentions:

> Prometheus-compatible endpoint for monitoring and observability.

And the agent-scheduler Helm chart already has Prometheus scrape annotations configured.

However, no Prometheus metrics are actually implemented in the Router or Workload Manager Go code. Both servers rely exclusively on `klog` statements for operational visibility, meaning there is no way to track sandbox creation latency, failure rates, GC behavior, or proxy performance through standard monitoring tooling.

## Proposal

Implement the Prometheus `/metrics` endpoints for both the Workload Manager and Router servers, fulfilling the feature listed in the v0.1.0 release notes. The implementation would use `prometheus/client_golang` (already a transitive dependency via `client-go`).

The metrics should cover the three operational surfaces that matter most for AgentCube operators:

### 1. Sandbox Lifecycle (Workload Manager)

These metrics capture the full create/delete lifecycle that the Workload Manager orchestrates:

- `agentcube_sandbox_create_duration_seconds` - histogram, labeled by `kind` (AgentRuntime / CodeInterpreter) and `status` (success / failure), covering the full end-to-end creation path (store placeholder, K8s resource creation, readiness wait, entrypoint probe)
- `agentcube_sandbox_create_total` - counter by `kind` and `status`
- `agentcube_sandbox_delete_total` - counter by `kind`
- `agentcube_sandbox_rollback_total` - counter, tracks how often creation rollbacks are triggered

### 2. Garbage Collection (Workload Manager)

These metrics give visibility into the GC loop that reclaims idle and expired sandboxes:

- `agentcube_gc_cycle_duration_seconds` - histogram for each `once()` cycle
- `agentcube_gc_sandboxes_reclaimed_total` - counter by `reason` (expired / inactive)
- `agentcube_gc_errors_total` - counter for GC-cycle failures

### 3. Request Proxy (Router)

These metrics cover the reverse-proxy path that routes user traffic into sandboxes:

- `agentcube_router_request_duration_seconds` - histogram by `kind`
- `agentcube_router_proxy_errors_total` - counter by error category (connection_refused / timeout / other)
- `agentcube_router_concurrent_requests` - gauge reflecting current concurrency against the configured limit
- `agentcube_router_session_create_total` - counter for sessions created via the router's implicit sandbox creation path

## Implementation Scope

The implementation is additive and self-contained. No existing behavior changes.

Note: The Workload Manager currently uses controller-runtime for CRD reconciliation, but explicitly disables its built-in metrics server (`BindAddress: "0"` in `cmd/workload-manager/main.go`). Both servers use their own Gin HTTP servers for the API. The proposed approach is to register a `/metrics` handler on the existing Gin servers (via `promhttp.Handler()`), keeping the architecture consistent rather than re-enabling a separate controller-runtime metrics port.

Changes:

- New `metrics.go` files in `pkg/workloadmanager/` and `pkg/router/` for metric definitions and registration
- Instrumentation calls added at existing lifecycle points in `handlers.go`, `garbage_collection.go`, and router `handlers.go`
- A `/metrics` HTTP handler registered on each Gin server alongside existing health endpoints
- Prometheus scrape annotations added to the Router and Workload Manager Service manifests in the Helm chart (matching the existing pattern in `volcano-agent-scheduler-development.yaml`)
- Unit tests verifying metric emission for key code paths

## Why This Matters

This is foundational infrastructure for three immediate needs:

1. **Operational visibility** - operators can build Grafana dashboards and set alerts on sandbox creation failure rates, GC health, and proxy latency without log parsing
2. **SLO definition** - teams deploying AgentCube can define SLOs around creation latency (p99 < 30s) and availability, backed by real metrics
3. **Capacity planning** - concurrent request gauges and GC reclamation rates provide the data needed to right-size deployments

It also becomes a prerequisite for future work like autoscaling warm pools based on creation latency, or triggering alerts when GC falls behind.

## Suggested Metrics Conventions

- All metric names prefixed with `agentcube_` to avoid collisions
- Labels follow Prometheus naming conventions (`snake_case`, low cardinality)
- Histogram buckets aligned with expected latency ranges (sub-second for store ops, multi-second for sandbox creation)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add Prometheus Metrics for Workload Manager and Router Observability #333

Background

Proposal

1. Sandbox Lifecycle (Workload Manager)

2. Garbage Collection (Workload Manager)

3. Request Proxy (Router)

Implementation Scope

Why This Matters

Suggested Metrics Conventions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal] Add Prometheus Metrics for Workload Manager and Router Observability #333

Description

Background

Proposal

1. Sandbox Lifecycle (Workload Manager)

2. Garbage Collection (Workload Manager)

3. Request Proxy (Router)

Implementation Scope

Why This Matters

Suggested Metrics Conventions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions