feat: Add health check endpoints for container orchestration#96
Open
Edition-X wants to merge 8 commits intoclearml:mainfrom
Open
feat: Add health check endpoints for container orchestration#96Edition-X wants to merge 8 commits intoclearml:mainfrom
Edition-X wants to merge 8 commits intoclearml:mainfrom
Conversation
- Added /health, /readiness, /liveness and /metrics endpoints for monitoring service status - Implemented request tracking in ModelRequestProcessor to count requests and record last prediction time - Added service instance ID and startup time tracking for monitoring - Added GPU memory metrics collection using pynvml when available - Enhanced readiness check to verify model loading status and GPU availability - Added detailed metrics endpoint providing
- Fix _request_count race condition by guarding with threading.Lock - /readiness no longer 503s a healthy service with no endpoints configured; a running processor with zero endpoints returns 200 with models_loaded=0 - Rename /metrics to /health/metrics to avoid Prometheus scraper confusion - Remove unnecessary global declarations in read-only endpoint functions - Simplify get_loaded_endpoints/get_request_count/get_last_prediction_time: remove dead hasattr/getattr guards since attributes are always set in __init__ - Remove dead AttributeError catch in metrics endpoint - Remove redundant ModuleNotFoundError alongside ImportError catches - Fix pynvml resource handling: separate ImportError from runtime errors, guard nvmlShutdown behind nvml_initialized flag so it is only called if nvmlInit succeeded, and catch all NVMLError variants via except Exception - Replace redundant service_instance_id (uuid4) with existing instance_id returned by setup_task(), which is the ClearML inference task ID Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove torch.cuda.is_available() from /readiness (not a project dependency, irrelevant to readiness semantics). Replace threading.Lock request counter with lock-free itertools.count matching existing FastWriteCounter pattern. Fix README to reflect actual endpoint paths and JSON format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds health monitoring endpoints to ClearML Serving inference containers for use with Kubernetes, Docker Swarm, and other container orchestration platforms.
New Endpoints
GET /healthGET /readinessModelRequestProcessoris initializedGET /livenessGET /health/metricsChanges
clearml_serving/serving/main.pystartup_timetracked at module load for uptime reportinginstance_idfromsetup_task()used as the service identifier in/healthclearml_serving/serving/model_request_processor.py_request_count(guarded bythreading.Lock) and_last_prediction_time, incremented inprocess_request()get_loaded_endpoints(),get_request_count(),get_last_prediction_time()accessorsKubernetes Integration
Docker Healthcheck
Closes #94
🤖 Generated with Claude Code