Skip to content

Feature request: add /healthz and /readyz HTTP endpoints for liveness and readiness monitoring #4649

@chaitanyapantheor

Description

@chaitanyapantheor

Is your feature request related to a problem? Please describe.

There is currently no way for external systems (Kubernetes liveness/readiness probes, load balancers, monitoring dashboards) to check whether a gcsfuse mount is healthy without inspecting the process directly or attempting a filesystem operation. When gcsfuse is deployed at scale across many GCS buckets/tenants, operators have no lightweight signal for whether a given mount is alive and serving requests, or whether it has finished initializing and is ready to accept traffic.

Describe the solution you'd like

Add two HTTP endpoints to the existing Prometheus HTTP server (already started when --prometheus-port is set). No new port is needed.

GET http://localhost:{prometheus-port}/healthz   → liveness
GET http://localhost:{prometheus-port}/readyz    → readiness

/healthz — liveness

Answers: is the gcsfuse process and mount still alive?

  • 200 OK — mount is active (has completed successfully and not yet torn down)
  • 503 Service Unavailable — mount is not yet up, or is shutting down

Determined by an atomic.Bool set to true after markSuccessfulMount() and false when teardown begins. Makes no GCS API calls.

/readyz — readiness

Answers: is the mount healthy enough to serve requests?

  • 200 OK — mount is live AND recent error rate is below threshold
  • 503 Service Unavailable — error rate exceeds threshold or mount is not live

Derived from the already-collected fs/ops_error_count OpenTelemetry metric. No new GCS calls — entirely passive.

New flag: --health-check-error-rate-threshold

Controls the error rate (errors/total ops, as a float in [0.0, 1.0]) above which /readyz returns 503. Default: 0.05 (5%).

# config.yaml equivalent
metrics:
  health-check-error-rate-threshold: 0.05

Both checks are entirely passive — they read from already-collected metrics and in-memory mount state. They make no GCS API calls, so they will not wake idle COS bindings or incur additional GCS operation costs.

Proposed implementation

File Change
internal/monitor/otelexporters.go Register /healthz and /readyz routes on the existing http.ServeMux in serveMetrics(); accept a mountState func and error-rate reader
cmd/legacy_main.go Wire mount lifecycle into the health state after markSuccessfulMount() and before mfs.Join() returns
cfg/params.yaml Add health-check-error-rate-threshold under the metrics section
cfg/config.go Add HealthCheckErrorRateThreshold float64 to MetricsConfig

Rough scope: ~80–120 lines of new code, no new dependencies.

Describe alternatives you've considered

  • Active GCS probe (e.g. HEAD request to the bucket root): Works but wakes COS bindings on idle mounts and adds per-check GCS cost. Rejected for scale deployments.
  • Filesystem stat of the mount point: Less reliable — returns stale data if the kernel has cached the inode. Also ties health to local kernel state rather than GCS connectivity metrics.
  • Separate health port: Adds operational complexity. Reusing --prometheus-port keeps the configuration surface minimal and avoids an additional open port per mount.
  • Single combined endpoint: Separating liveness (/healthz) from readiness (/readyz) follows Kubernetes convention and allows probes to be configured independently — e.g. a longer failure threshold for readiness without restarting the container.

Additional context

The Prometheus HTTP server is already present in internal/monitor/otelexporters.go (lines 167–194) and runs whenever --prometheus-port > 0. This change adds two routes to the same server with no impact to the existing /metrics endpoint.

Multi-tenant aggregation of health across gcsfuse instances (scraping all per-mount endpoints into a single Grafana view) is out of scope for this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestFeature request: request to add new features or functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions