Skip to content

[Proposal] Add Prometheus Metrics for Workload Manager and Router Observability #333

@Abhinav-kodes

Description

@Abhinav-kodes

Background

The v0.1.0 release blog (docs/agentcube/blog/release-v0.1.0/index.md) lists under "Features and Enhancements":

Prometheus metrics: metrics exported by Router and Workload Manager for operational observability

The PicoD design proposal (docs/design/picod-proposal.md) also mentions:

Prometheus-compatible endpoint for monitoring and observability.

And the agent-scheduler Helm chart already has Prometheus scrape annotations configured.

However, no Prometheus metrics are actually implemented in the Router or Workload Manager Go code. Both servers rely exclusively on klog statements for operational visibility, meaning there is no way to track sandbox creation latency, failure rates, GC behavior, or proxy performance through standard monitoring tooling.

Proposal

Implement the Prometheus /metrics endpoints for both the Workload Manager and Router servers, fulfilling the feature listed in the v0.1.0 release notes. The implementation would use prometheus/client_golang (already a transitive dependency via client-go).

The metrics should cover the three operational surfaces that matter most for AgentCube operators:

1. Sandbox Lifecycle (Workload Manager)

These metrics capture the full create/delete lifecycle that the Workload Manager orchestrates:

  • agentcube_sandbox_create_duration_seconds - histogram, labeled by kind (AgentRuntime / CodeInterpreter) and status (success / failure), covering the full end-to-end creation path (store placeholder, K8s resource creation, readiness wait, entrypoint probe)
  • agentcube_sandbox_create_total - counter by kind and status
  • agentcube_sandbox_delete_total - counter by kind
  • agentcube_sandbox_rollback_total - counter, tracks how often creation rollbacks are triggered

2. Garbage Collection (Workload Manager)

These metrics give visibility into the GC loop that reclaims idle and expired sandboxes:

  • agentcube_gc_cycle_duration_seconds - histogram for each once() cycle
  • agentcube_gc_sandboxes_reclaimed_total - counter by reason (expired / inactive)
  • agentcube_gc_errors_total - counter for GC-cycle failures

3. Request Proxy (Router)

These metrics cover the reverse-proxy path that routes user traffic into sandboxes:

  • agentcube_router_request_duration_seconds - histogram by kind
  • agentcube_router_proxy_errors_total - counter by error category (connection_refused / timeout / other)
  • agentcube_router_concurrent_requests - gauge reflecting current concurrency against the configured limit
  • agentcube_router_session_create_total - counter for sessions created via the router's implicit sandbox creation path

Implementation Scope

The implementation is additive and self-contained. No existing behavior changes.

Note: The Workload Manager currently uses controller-runtime for CRD reconciliation, but explicitly disables its built-in metrics server (BindAddress: "0" in cmd/workload-manager/main.go). Both servers use their own Gin HTTP servers for the API. The proposed approach is to register a /metrics handler on the existing Gin servers (via promhttp.Handler()), keeping the architecture consistent rather than re-enabling a separate controller-runtime metrics port.

Changes:

  • New metrics.go files in pkg/workloadmanager/ and pkg/router/ for metric definitions and registration
  • Instrumentation calls added at existing lifecycle points in handlers.go, garbage_collection.go, and router handlers.go
  • A /metrics HTTP handler registered on each Gin server alongside existing health endpoints
  • Prometheus scrape annotations added to the Router and Workload Manager Service manifests in the Helm chart (matching the existing pattern in volcano-agent-scheduler-development.yaml)
  • Unit tests verifying metric emission for key code paths

Why This Matters

This is foundational infrastructure for three immediate needs:

  1. Operational visibility - operators can build Grafana dashboards and set alerts on sandbox creation failure rates, GC health, and proxy latency without log parsing
  2. SLO definition - teams deploying AgentCube can define SLOs around creation latency (p99 < 30s) and availability, backed by real metrics
  3. Capacity planning - concurrent request gauges and GC reclamation rates provide the data needed to right-size deployments

It also becomes a prerequisite for future work like autoscaling warm pools based on creation latency, or triggering alerts when GC falls behind.

Suggested Metrics Conventions

  • All metric names prefixed with agentcube_ to avoid collisions
  • Labels follow Prometheus naming conventions (snake_case, low cardinality)
  • Histogram buckets aligned with expected latency ranges (sub-second for store ops, multi-second for sandbox creation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions