Feature request: add /healthz and /readyz HTTP endpoints for liveness and readiness monitoring

**Is your feature request related to a problem? Please describe.**

There is currently no way for external systems (Kubernetes liveness/readiness probes, load balancers, monitoring dashboards) to check whether a gcsfuse mount is healthy without inspecting the process directly or attempting a filesystem operation. When gcsfuse is deployed at scale across many GCS buckets/tenants, operators have no lightweight signal for whether a given mount is alive and serving requests, or whether it has finished initializing and is ready to accept traffic.

**Describe the solution you'd like**

Add two HTTP endpoints to the existing Prometheus HTTP server (already started when `--prometheus-port` is set). No new port is needed.

```
GET http://localhost:{prometheus-port}/healthz   → liveness
GET http://localhost:{prometheus-port}/readyz    → readiness
```

**`/healthz` — liveness**

Answers: _is the gcsfuse process and mount still alive?_

- `200 OK` — mount is active (has completed successfully and not yet torn down)
- `503 Service Unavailable` — mount is not yet up, or is shutting down

Determined by an `atomic.Bool` set to `true` after `markSuccessfulMount()` and `false` when teardown begins. Makes **no GCS API calls**.

**`/readyz` — readiness**

Answers: _is the mount healthy enough to serve requests?_

- `200 OK` — mount is live AND recent error rate is below threshold
- `503 Service Unavailable` — error rate exceeds threshold or mount is not live

Derived from the already-collected `fs/ops_error_count` OpenTelemetry metric. No new GCS calls — entirely passive.

**New flag: `--health-check-error-rate-threshold`**

Controls the error rate (errors/total ops, as a float in `[0.0, 1.0]`) above which `/readyz` returns `503`. Default: `0.05` (5%).

```yaml
# config.yaml equivalent
metrics:
  health-check-error-rate-threshold: 0.05
```

Both checks are **entirely passive** — they read from already-collected metrics and in-memory mount state. They make **no GCS API calls**, so they will not wake idle COS bindings or incur additional GCS operation costs.

**Proposed implementation**

| File | Change |
|---|---|
| `internal/monitor/otelexporters.go` | Register `/healthz` and `/readyz` routes on the existing `http.ServeMux` in `serveMetrics()`; accept a `mountState` func and error-rate reader |
| `cmd/legacy_main.go` | Wire mount lifecycle into the health state after `markSuccessfulMount()` and before `mfs.Join()` returns |
| `cfg/params.yaml` | Add `health-check-error-rate-threshold` under the `metrics` section |
| `cfg/config.go` | Add `HealthCheckErrorRateThreshold float64` to `MetricsConfig` |

Rough scope: ~80–120 lines of new code, no new dependencies.

**Describe alternatives you've considered**

- **Active GCS probe** (e.g. `HEAD` request to the bucket root): Works but wakes COS bindings on idle mounts and adds per-check GCS cost. Rejected for scale deployments.
- **Filesystem `stat` of the mount point**: Less reliable — returns stale data if the kernel has cached the inode. Also ties health to local kernel state rather than GCS connectivity metrics.
- **Separate health port**: Adds operational complexity. Reusing `--prometheus-port` keeps the configuration surface minimal and avoids an additional open port per mount.
- **Single combined endpoint**: Separating liveness (`/healthz`) from readiness (`/readyz`) follows Kubernetes convention and allows probes to be configured independently — e.g. a longer failure threshold for readiness without restarting the container.

**Additional context**

The Prometheus HTTP server is already present in `internal/monitor/otelexporters.go` (lines 167–194) and runs whenever `--prometheus-port > 0`. This change adds two routes to the same server with no impact to the existing `/metrics` endpoint.

Multi-tenant aggregation of health across gcsfuse instances (scraping all per-mount endpoints into a single Grafana view) is out of scope for this issue.

File	Change
`internal/monitor/otelexporters.go`	Register `/healthz` and `/readyz` routes on the existing `http.ServeMux` in `serveMetrics()`; accept a `mountState` func and error-rate reader
`cmd/legacy_main.go`	Wire mount lifecycle into the health state after `markSuccessfulMount()` and before `mfs.Join()` returns
`cfg/params.yaml`	Add `health-check-error-rate-threshold` under the `metrics` section
`cfg/config.go`	Add `HealthCheckErrorRateThreshold float64` to `MetricsConfig`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: add /healthz and /readyz HTTP endpoints for liveness and readiness monitoring #4649

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature request: add /healthz and /readyz HTTP endpoints for liveness and readiness monitoring #4649

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions