Skip to content

feat: Add health check endpoints for container orchestration#96

Open
Edition-X wants to merge 8 commits intoclearml:mainfrom
Edition-X:feature/health-endpoints
Open

feat: Add health check endpoints for container orchestration#96
Edition-X wants to merge 8 commits intoclearml:mainfrom
Edition-X:feature/health-endpoints

Conversation

@Edition-X
Copy link
Copy Markdown

Description

Adds health monitoring endpoints to ClearML Serving inference containers for use with Kubernetes, Docker Swarm, and other container orchestration platforms.

New Endpoints

Endpoint Purpose
GET /health Basic health check — always 200 when the process is running
GET /readiness Readiness probe — 503 until ModelRequestProcessor is initialized
GET /liveness Lightweight liveness probe for orchestration keep-alive checks
GET /health/metrics JSON service metrics: uptime, request count, loaded models, GPU memory

Changes

clearml_serving/serving/main.py

  • Added four health endpoints above
  • startup_time tracked at module load for uptime reporting
  • instance_id from setup_task() used as the service identifier in /health

clearml_serving/serving/model_request_processor.py

  • Added _request_count (guarded by threading.Lock) and _last_prediction_time, incremented in process_request()
  • Added get_loaded_endpoints(), get_request_count(), get_last_prediction_time() accessors

Kubernetes Integration

livenessProbe:
  httpGet:
    path: /liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /readiness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Docker Healthcheck

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Closes #94

🤖 Generated with Claude Code

Edition-X and others added 8 commits March 14, 2026 15:11
- Added /health, /readiness, /liveness and /metrics endpoints for monitoring service status
- Implemented request tracking in ModelRequestProcessor to count requests and record last prediction time
- Added service instance ID and startup time tracking for monitoring
- Added GPU memory metrics collection using pynvml when available
- Enhanced readiness check to verify model loading status and GPU availability
- Added detailed metrics endpoint providing
- Fix _request_count race condition by guarding with threading.Lock
- /readiness no longer 503s a healthy service with no endpoints configured;
  a running processor with zero endpoints returns 200 with models_loaded=0
- Rename /metrics to /health/metrics to avoid Prometheus scraper confusion
- Remove unnecessary global declarations in read-only endpoint functions
- Simplify get_loaded_endpoints/get_request_count/get_last_prediction_time:
  remove dead hasattr/getattr guards since attributes are always set in __init__
- Remove dead AttributeError catch in metrics endpoint
- Remove redundant ModuleNotFoundError alongside ImportError catches
- Fix pynvml resource handling: separate ImportError from runtime errors,
  guard nvmlShutdown behind nvml_initialized flag so it is only called
  if nvmlInit succeeded, and catch all NVMLError variants via except Exception
- Replace redundant service_instance_id (uuid4) with existing instance_id
  returned by setup_task(), which is the ClearML inference task ID

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove torch.cuda.is_available() from /readiness (not a project dependency,
irrelevant to readiness semantics). Replace threading.Lock request counter
with lock-free itertools.count matching existing FastWriteCounter pattern.
Fix README to reflect actual endpoint paths and JSON format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add Standard Health Check Endpoints to ClearML Serving

1 participant