feat: Add health check endpoints for container orchestration by Edition-X · Pull Request #96 · clearml/clearml-serving

Edition-X · 2026-03-14T18:41:16Z

Description

Adds health monitoring endpoints to ClearML Serving inference containers for use with Kubernetes, Docker Swarm, and other container orchestration platforms.

New Endpoints

Endpoint	Purpose
`GET /health`	Basic health check — always 200 when the process is running
`GET /readiness`	Readiness probe — 503 until `ModelRequestProcessor` is initialized
`GET /liveness`	Lightweight liveness probe for orchestration keep-alive checks
`GET /health/metrics`	JSON service metrics: uptime, request count, loaded models, GPU memory

Changes

clearml_serving/serving/main.py

Added four health endpoints above
startup_time tracked at module load for uptime reporting
instance_id from setup_task() used as the service identifier in /health

clearml_serving/serving/model_request_processor.py

Added _request_count (guarded by threading.Lock) and _last_prediction_time, incremented in process_request()
Added get_loaded_endpoints(), get_request_count(), get_last_prediction_time() accessors

Kubernetes Integration

livenessProbe:
  httpGet:
    path: /liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /readiness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Docker Healthcheck

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Closes #94

🤖 Generated with Claude Code

- Added /health, /readiness, /liveness and /metrics endpoints for monitoring service status - Implemented request tracking in ModelRequestProcessor to count requests and record last prediction time - Added service instance ID and startup time tracking for monitoring - Added GPU memory metrics collection using pynvml when available - Enhanced readiness check to verify model loading status and GPU availability - Added detailed metrics endpoint providing

…ompatibility

- Fix _request_count race condition by guarding with threading.Lock - /readiness no longer 503s a healthy service with no endpoints configured; a running processor with zero endpoints returns 200 with models_loaded=0 - Rename /metrics to /health/metrics to avoid Prometheus scraper confusion - Remove unnecessary global declarations in read-only endpoint functions - Simplify get_loaded_endpoints/get_request_count/get_last_prediction_time: remove dead hasattr/getattr guards since attributes are always set in __init__ - Remove dead AttributeError catch in metrics endpoint - Remove redundant ModuleNotFoundError alongside ImportError catches - Fix pynvml resource handling: separate ImportError from runtime errors, guard nvmlShutdown behind nvml_initialized flag so it is only called if nvmlInit succeeded, and catch all NVMLError variants via except Exception - Replace redundant service_instance_id (uuid4) with existing instance_id returned by setup_task(), which is the ClearML inference task ID Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove torch.cuda.is_available() from /readiness (not a project dependency, irrelevant to readiness semantics). Replace threading.Lock request counter with lock-free itertools.count matching existing FastWriteCounter pattern. Fix README to reflect actual endpoint paths and JSON format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Edition-X and others added 8 commits March 14, 2026 15:11

fix: f-string syntax for Python 3.10 compatibility

d214bfd

fix: make vllm imports and OpenAI endpoints optional for production c…

9bf3e87

…ompatibility

fix indent

6c71d7e

docs: add health check endpoints documentation

4ab3918

add back how it was

c8b5dc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add health check endpoints for container orchestration#96

feat: Add health check endpoints for container orchestration#96
Edition-X wants to merge 8 commits intoclearml:mainfrom
Edition-X:feature/health-endpoints

Edition-X commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Edition-X commented Mar 14, 2026

Description

New Endpoints

Changes

Kubernetes Integration

Docker Healthcheck

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant