fix(tracking): finalize MLflow runs on SIGTERM instead of leaving them RUNNING#1819
fix(tracking): finalize MLflow runs on SIGTERM instead of leaving them RUNNING#1819EazyReal wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces robust signal and exit handling to the Tracking class to ensure MLflow runs are properly finalized (e.g., marked as KILLED) during abnormal process termination, such as a SIGTERM from SLURM. It also adds corresponding unit tests to verify status forwarding, idempotency, and signal handling. The review feedback highlights a critical issue where registering bound methods directly with atexit and signal creates strong reference cycles, preventing garbage collection of the Tracking instance. To resolve this, the reviewer suggests using weakref wrappers, unregistering the handlers upon completion, and updating the tests and imports accordingly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Good catch on the reference cycle. Fixed: the atexit hook and SIGTERM handler now reference |
10ad74d to
cdb878b
Compare
…m RUNNING MLflow runs were only finalized from Tracking.__del__, which Python does not guarantee to run on signal-driven termination (e.g. SLURM `scancel` sending SIGTERM), so affected runs stayed stuck in RUNNING with a null end_time. Thread a `status` through Tracking.finish() and the MLflow adapter so end_run() records a terminal status, register an atexit hook plus a SIGTERM handler that finalize the run (KILLED on signal), and make finish() idempotent so the overlapping cleanup paths only end the run once. The SIGTERM handler chains to the previous disposition so process teardown is preserved, and is skipped off the main thread (e.g. Ray workers) where signal.signal is unavailable. The hooks reference self via a weakref and are removed in finish(), so they don't keep the instance alive or turn __del__ into dead code. Fixes NovaSky-AI#1516 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cdb878b to
1066a2e
Compare
|
@SumanthRH @erictang000 Could I get a quick review on this tracking finalization fix? It closes MLflow runs on SIGTERM so interrupted jobs do not stay RUNNING forever. |
What
When a training process is killed by a signal (e.g. SLURM
scancel→ SIGTERM), its MLflow run is left instatus="RUNNING"withend_time=nullforever.Tracking.finish()(which callsmlflow.end_run()) is only invoked fromTracking.__del__, and Python does not guarantee__del__runs on signal-driven termination.Fix
Implements the approach suggested in the issue:
Tracking.finish(status="FINISHED")and the MLflow adapter forwardstatustoend_run.atexithook and aSIGTERMhandler that finalize the run (status="KILLED"on signal). The handler chains to the previous disposition so normal process teardown is preserved, and is skipped off the main thread (e.g. Ray workers) wheresignal.signalraises.finish()is idempotent, so thesignal/atexit/__del__paths end the run exactly once.CPU tests cover default/forwarded status, idempotency, and the SIGTERM path.
Fixes #1516