Is your feature request related to a problem? Please describe.
There is currently no way for external systems (Kubernetes liveness/readiness probes, load balancers, monitoring dashboards) to check whether a gcsfuse mount is healthy without inspecting the process directly or attempting a filesystem operation. When gcsfuse is deployed at scale across many GCS buckets/tenants, operators have no lightweight signal for whether a given mount is alive and serving requests, or whether it has finished initializing and is ready to accept traffic.
Describe the solution you'd like
Add two HTTP endpoints to the existing Prometheus HTTP server (already started when --prometheus-port is set). No new port is needed.
GET http://localhost:{prometheus-port}/healthz → liveness
GET http://localhost:{prometheus-port}/readyz → readiness
/healthz — liveness
Answers: is the gcsfuse process and mount still alive?
200 OK — mount is active (has completed successfully and not yet torn down)
503 Service Unavailable — mount is not yet up, or is shutting down
Determined by an atomic.Bool set to true after markSuccessfulMount() and false when teardown begins. Makes no GCS API calls.
/readyz — readiness
Answers: is the mount healthy enough to serve requests?
200 OK — mount is live AND recent error rate is below threshold
503 Service Unavailable — error rate exceeds threshold or mount is not live
Derived from the already-collected fs/ops_error_count OpenTelemetry metric. No new GCS calls — entirely passive.
New flag: --health-check-error-rate-threshold
Controls the error rate (errors/total ops, as a float in [0.0, 1.0]) above which /readyz returns 503. Default: 0.05 (5%).
# config.yaml equivalent
metrics:
health-check-error-rate-threshold: 0.05
Both checks are entirely passive — they read from already-collected metrics and in-memory mount state. They make no GCS API calls, so they will not wake idle COS bindings or incur additional GCS operation costs.
Proposed implementation
| File |
Change |
internal/monitor/otelexporters.go |
Register /healthz and /readyz routes on the existing http.ServeMux in serveMetrics(); accept a mountState func and error-rate reader |
cmd/legacy_main.go |
Wire mount lifecycle into the health state after markSuccessfulMount() and before mfs.Join() returns |
cfg/params.yaml |
Add health-check-error-rate-threshold under the metrics section |
cfg/config.go |
Add HealthCheckErrorRateThreshold float64 to MetricsConfig |
Rough scope: ~80–120 lines of new code, no new dependencies.
Describe alternatives you've considered
- Active GCS probe (e.g.
HEAD request to the bucket root): Works but wakes COS bindings on idle mounts and adds per-check GCS cost. Rejected for scale deployments.
- Filesystem
stat of the mount point: Less reliable — returns stale data if the kernel has cached the inode. Also ties health to local kernel state rather than GCS connectivity metrics.
- Separate health port: Adds operational complexity. Reusing
--prometheus-port keeps the configuration surface minimal and avoids an additional open port per mount.
- Single combined endpoint: Separating liveness (
/healthz) from readiness (/readyz) follows Kubernetes convention and allows probes to be configured independently — e.g. a longer failure threshold for readiness without restarting the container.
Additional context
The Prometheus HTTP server is already present in internal/monitor/otelexporters.go (lines 167–194) and runs whenever --prometheus-port > 0. This change adds two routes to the same server with no impact to the existing /metrics endpoint.
Multi-tenant aggregation of health across gcsfuse instances (scraping all per-mount endpoints into a single Grafana view) is out of scope for this issue.
Is your feature request related to a problem? Please describe.
There is currently no way for external systems (Kubernetes liveness/readiness probes, load balancers, monitoring dashboards) to check whether a gcsfuse mount is healthy without inspecting the process directly or attempting a filesystem operation. When gcsfuse is deployed at scale across many GCS buckets/tenants, operators have no lightweight signal for whether a given mount is alive and serving requests, or whether it has finished initializing and is ready to accept traffic.
Describe the solution you'd like
Add two HTTP endpoints to the existing Prometheus HTTP server (already started when
--prometheus-portis set). No new port is needed./healthz— livenessAnswers: is the gcsfuse process and mount still alive?
200 OK— mount is active (has completed successfully and not yet torn down)503 Service Unavailable— mount is not yet up, or is shutting downDetermined by an
atomic.Boolset totrueaftermarkSuccessfulMount()andfalsewhen teardown begins. Makes no GCS API calls./readyz— readinessAnswers: is the mount healthy enough to serve requests?
200 OK— mount is live AND recent error rate is below threshold503 Service Unavailable— error rate exceeds threshold or mount is not liveDerived from the already-collected
fs/ops_error_countOpenTelemetry metric. No new GCS calls — entirely passive.New flag:
--health-check-error-rate-thresholdControls the error rate (errors/total ops, as a float in
[0.0, 1.0]) above which/readyzreturns503. Default:0.05(5%).Both checks are entirely passive — they read from already-collected metrics and in-memory mount state. They make no GCS API calls, so they will not wake idle COS bindings or incur additional GCS operation costs.
Proposed implementation
internal/monitor/otelexporters.go/healthzand/readyzroutes on the existinghttp.ServeMuxinserveMetrics(); accept amountStatefunc and error-rate readercmd/legacy_main.gomarkSuccessfulMount()and beforemfs.Join()returnscfg/params.yamlhealth-check-error-rate-thresholdunder themetricssectioncfg/config.goHealthCheckErrorRateThreshold float64toMetricsConfigRough scope: ~80–120 lines of new code, no new dependencies.
Describe alternatives you've considered
HEADrequest to the bucket root): Works but wakes COS bindings on idle mounts and adds per-check GCS cost. Rejected for scale deployments.statof the mount point: Less reliable — returns stale data if the kernel has cached the inode. Also ties health to local kernel state rather than GCS connectivity metrics.--prometheus-portkeeps the configuration surface minimal and avoids an additional open port per mount./healthz) from readiness (/readyz) follows Kubernetes convention and allows probes to be configured independently — e.g. a longer failure threshold for readiness without restarting the container.Additional context
The Prometheus HTTP server is already present in
internal/monitor/otelexporters.go(lines 167–194) and runs whenever--prometheus-port > 0. This change adds two routes to the same server with no impact to the existing/metricsendpoint.Multi-tenant aggregation of health across gcsfuse instances (scraping all per-mount endpoints into a single Grafana view) is out of scope for this issue.