Background
The v0.1.0 release blog (docs/agentcube/blog/release-v0.1.0/index.md) lists under "Features and Enhancements":
Prometheus metrics: metrics exported by Router and Workload Manager for operational observability
The PicoD design proposal (docs/design/picod-proposal.md) also mentions:
Prometheus-compatible endpoint for monitoring and observability.
And the agent-scheduler Helm chart already has Prometheus scrape annotations configured.
However, no Prometheus metrics are actually implemented in the Router or Workload Manager Go code. Both servers rely exclusively on klog statements for operational visibility, meaning there is no way to track sandbox creation latency, failure rates, GC behavior, or proxy performance through standard monitoring tooling.
Proposal
Implement the Prometheus /metrics endpoints for both the Workload Manager and Router servers, fulfilling the feature listed in the v0.1.0 release notes. The implementation would use prometheus/client_golang (already a transitive dependency via client-go).
The metrics should cover the three operational surfaces that matter most for AgentCube operators:
1. Sandbox Lifecycle (Workload Manager)
These metrics capture the full create/delete lifecycle that the Workload Manager orchestrates:
agentcube_sandbox_create_duration_seconds - histogram, labeled by kind (AgentRuntime / CodeInterpreter) and status (success / failure), covering the full end-to-end creation path (store placeholder, K8s resource creation, readiness wait, entrypoint probe)
agentcube_sandbox_create_total - counter by kind and status
agentcube_sandbox_delete_total - counter by kind
agentcube_sandbox_rollback_total - counter, tracks how often creation rollbacks are triggered
2. Garbage Collection (Workload Manager)
These metrics give visibility into the GC loop that reclaims idle and expired sandboxes:
agentcube_gc_cycle_duration_seconds - histogram for each once() cycle
agentcube_gc_sandboxes_reclaimed_total - counter by reason (expired / inactive)
agentcube_gc_errors_total - counter for GC-cycle failures
3. Request Proxy (Router)
These metrics cover the reverse-proxy path that routes user traffic into sandboxes:
agentcube_router_request_duration_seconds - histogram by kind
agentcube_router_proxy_errors_total - counter by error category (connection_refused / timeout / other)
agentcube_router_concurrent_requests - gauge reflecting current concurrency against the configured limit
agentcube_router_session_create_total - counter for sessions created via the router's implicit sandbox creation path
Implementation Scope
The implementation is additive and self-contained. No existing behavior changes.
Note: The Workload Manager currently uses controller-runtime for CRD reconciliation, but explicitly disables its built-in metrics server (BindAddress: "0" in cmd/workload-manager/main.go). Both servers use their own Gin HTTP servers for the API. The proposed approach is to register a /metrics handler on the existing Gin servers (via promhttp.Handler()), keeping the architecture consistent rather than re-enabling a separate controller-runtime metrics port.
Changes:
- New
metrics.go files in pkg/workloadmanager/ and pkg/router/ for metric definitions and registration
- Instrumentation calls added at existing lifecycle points in
handlers.go, garbage_collection.go, and router handlers.go
- A
/metrics HTTP handler registered on each Gin server alongside existing health endpoints
- Prometheus scrape annotations added to the Router and Workload Manager Service manifests in the Helm chart (matching the existing pattern in
volcano-agent-scheduler-development.yaml)
- Unit tests verifying metric emission for key code paths
Why This Matters
This is foundational infrastructure for three immediate needs:
- Operational visibility - operators can build Grafana dashboards and set alerts on sandbox creation failure rates, GC health, and proxy latency without log parsing
- SLO definition - teams deploying AgentCube can define SLOs around creation latency (p99 < 30s) and availability, backed by real metrics
- Capacity planning - concurrent request gauges and GC reclamation rates provide the data needed to right-size deployments
It also becomes a prerequisite for future work like autoscaling warm pools based on creation latency, or triggering alerts when GC falls behind.
Suggested Metrics Conventions
- All metric names prefixed with
agentcube_ to avoid collisions
- Labels follow Prometheus naming conventions (
snake_case, low cardinality)
- Histogram buckets aligned with expected latency ranges (sub-second for store ops, multi-second for sandbox creation)
Background
The v0.1.0 release blog (docs/agentcube/blog/release-v0.1.0/index.md) lists under "Features and Enhancements":
The PicoD design proposal (docs/design/picod-proposal.md) also mentions:
And the agent-scheduler Helm chart already has Prometheus scrape annotations configured.
However, no Prometheus metrics are actually implemented in the Router or Workload Manager Go code. Both servers rely exclusively on
klogstatements for operational visibility, meaning there is no way to track sandbox creation latency, failure rates, GC behavior, or proxy performance through standard monitoring tooling.Proposal
Implement the Prometheus
/metricsendpoints for both the Workload Manager and Router servers, fulfilling the feature listed in the v0.1.0 release notes. The implementation would useprometheus/client_golang(already a transitive dependency viaclient-go).The metrics should cover the three operational surfaces that matter most for AgentCube operators:
1. Sandbox Lifecycle (Workload Manager)
These metrics capture the full create/delete lifecycle that the Workload Manager orchestrates:
agentcube_sandbox_create_duration_seconds- histogram, labeled bykind(AgentRuntime / CodeInterpreter) andstatus(success / failure), covering the full end-to-end creation path (store placeholder, K8s resource creation, readiness wait, entrypoint probe)agentcube_sandbox_create_total- counter bykindandstatusagentcube_sandbox_delete_total- counter bykindagentcube_sandbox_rollback_total- counter, tracks how often creation rollbacks are triggered2. Garbage Collection (Workload Manager)
These metrics give visibility into the GC loop that reclaims idle and expired sandboxes:
agentcube_gc_cycle_duration_seconds- histogram for eachonce()cycleagentcube_gc_sandboxes_reclaimed_total- counter byreason(expired / inactive)agentcube_gc_errors_total- counter for GC-cycle failures3. Request Proxy (Router)
These metrics cover the reverse-proxy path that routes user traffic into sandboxes:
agentcube_router_request_duration_seconds- histogram bykindagentcube_router_proxy_errors_total- counter by error category (connection_refused / timeout / other)agentcube_router_concurrent_requests- gauge reflecting current concurrency against the configured limitagentcube_router_session_create_total- counter for sessions created via the router's implicit sandbox creation pathImplementation Scope
The implementation is additive and self-contained. No existing behavior changes.
Note: The Workload Manager currently uses controller-runtime for CRD reconciliation, but explicitly disables its built-in metrics server (
BindAddress: "0"incmd/workload-manager/main.go). Both servers use their own Gin HTTP servers for the API. The proposed approach is to register a/metricshandler on the existing Gin servers (viapromhttp.Handler()), keeping the architecture consistent rather than re-enabling a separate controller-runtime metrics port.Changes:
metrics.gofiles inpkg/workloadmanager/andpkg/router/for metric definitions and registrationhandlers.go,garbage_collection.go, and routerhandlers.go/metricsHTTP handler registered on each Gin server alongside existing health endpointsvolcano-agent-scheduler-development.yaml)Why This Matters
This is foundational infrastructure for three immediate needs:
It also becomes a prerequisite for future work like autoscaling warm pools based on creation latency, or triggering alerts when GC falls behind.
Suggested Metrics Conventions
agentcube_to avoid collisionssnake_case, low cardinality)