Skip to content

fix(tracking): finalize MLflow runs on SIGTERM instead of leaving them RUNNING#1819

Open
EazyReal wants to merge 1 commit into
NovaSky-AI:mainfrom
EazyReal:fix/mlflow-finish-on-sigterm
Open

fix(tracking): finalize MLflow runs on SIGTERM instead of leaving them RUNNING#1819
EazyReal wants to merge 1 commit into
NovaSky-AI:mainfrom
EazyReal:fix/mlflow-finish-on-sigterm

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

What

When a training process is killed by a signal (e.g. SLURM scancel → SIGTERM), its MLflow run is left in status="RUNNING" with end_time=null forever. Tracking.finish() (which calls mlflow.end_run()) is only invoked from Tracking.__del__, and Python does not guarantee __del__ runs on signal-driven termination.

Fix

Implements the approach suggested in the issue:

  • Tracking.finish(status="FINISHED") and the MLflow adapter forward status to end_run.
  • Register an atexit hook and a SIGTERM handler that finalize the run (status="KILLED" on signal). The handler chains to the previous disposition so normal process teardown is preserved, and is skipped off the main thread (e.g. Ray workers) where signal.signal raises.
  • finish() is idempotent, so the signal / atexit / __del__ paths end the run exactly once.

CPU tests cover default/forwarded status, idempotency, and the SIGTERM path.

Fixes #1516

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust signal and exit handling to the Tracking class to ensure MLflow runs are properly finalized (e.g., marked as KILLED) during abnormal process termination, such as a SIGTERM from SLURM. It also adds corresponding unit tests to verify status forwarding, idempotency, and signal handling. The review feedback highlights a critical issue where registering bound methods directly with atexit and signal creates strong reference cycles, preventing garbage collection of the Tracking instance. To resolve this, the reviewer suggests using weakref wrappers, unregistering the handlers upon completion, and updating the tests and imports accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/train/utils/tracking.py Outdated
Comment thread skyrl/train/utils/tracking.py
Comment thread skyrl/train/utils/tracking.py
Comment thread tests/train/test_tracking.py Outdated
@EazyReal

Copy link
Copy Markdown
Contributor Author

Good catch on the reference cycle. Fixed: the atexit hook and SIGTERM handler now reference self through a weakref (so the instance stays collectable and __del__ keeps working), and finish() unregisters the atexit hook and restores the previous SIGTERM handler. Added a test asserting the instance is GC'd after finish().

@EazyReal EazyReal force-pushed the fix/mlflow-finish-on-sigterm branch from 10ad74d to cdb878b Compare June 20, 2026 19:35
…m RUNNING

MLflow runs were only finalized from Tracking.__del__, which Python does not
guarantee to run on signal-driven termination (e.g. SLURM `scancel` sending
SIGTERM), so affected runs stayed stuck in RUNNING with a null end_time.

Thread a `status` through Tracking.finish() and the MLflow adapter so end_run()
records a terminal status, register an atexit hook plus a SIGTERM handler that
finalize the run (KILLED on signal), and make finish() idempotent so the
overlapping cleanup paths only end the run once. The SIGTERM handler chains to
the previous disposition so process teardown is preserved, and is skipped off the
main thread (e.g. Ray workers) where signal.signal is unavailable.

The hooks reference self via a weakref and are removed in finish(), so they don't
keep the instance alive or turn __del__ into dead code.

Fixes NovaSky-AI#1516

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the fix/mlflow-finish-on-sigterm branch from cdb878b to 1066a2e Compare June 30, 2026 08:41
@EazyReal

Copy link
Copy Markdown
Contributor Author

@SumanthRH @erictang000 Could I get a quick review on this tracking finalization fix? It closes MLflow runs on SIGTERM so interrupted jobs do not stay RUNNING forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MLflow runs stuck in RUNNING after SIGTERM

1 participant