Skip to content

[FIX] local backend trial cleanup#13

Open
VincentG1234 wants to merge 2 commits into
mainfrom
fix/local-backend-cleanup
Open

[FIX] local backend trial cleanup#13
VincentG1234 wants to merge 2 commits into
mainfrom
fix/local-backend-cleanup

Conversation

@VincentG1234

Copy link
Copy Markdown
Collaborator

Summary

  • improve local backend cleanup by tracking trial controllers alongside futures
  • request cooperative cancellation before forcing resource cleanup
  • wait briefly for running local trials to stop before clearing backend state

Why

The previous local cleanup path only cancelled futures, which is not enough when trial work is already running. This change makes shutdown and cleanup more reliable by propagating cancellation to the trial controller and attempting resource cleanup for still-running trials.

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
@VincentG1234

Copy link
Copy Markdown
Collaborator Author

@hgsmn tell me If that fix the bug you met

@VincentG1234 VincentG1234 changed the title Fix local backend trial cleanup [FIX] local backend trial cleanup Apr 27, 2026
@hgsmn

hgsmn commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Yes it works for me. Stopping the process is pretty long because vllm needs to finish to process requests before stopping.

@hgsmn

hgsmn commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

#2

@hgsmn

hgsmn commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

In the cleanup_all_trials function, we can maybe add more comments and logs during the cleaning process to check the process that need to be forced stop.

Here is a proposal for the full function :

def cleanup_all_trials(self):
    """Force cleanup of all active local trials."""
    if not self.active_futures:
        logger.debug("No active local trials to cleanup")
        return

    logger.info(f"Cleaning up {len(self.active_futures)} active local trial(s)")

    running_jobs = [
        (job_id, future)
        for job_id, future in self.active_futures.items()
        if not future.done()
    ]

    # Phase 1 — Graceful cancellation request
    for job_id, _future in running_jobs:
        controller = self.active_controllers.get(job_id)
        if controller is None:
            logger.warning(
                f"Inconsistent state: future exists for trial {job_id} "
                "but no controller found — skipping cancellation"
            )
            continue

        try:
            controller.request_cancellation()
            logger.debug(f"Requested cancellation for local trial {job_id}")
        except Exception as e:
            logger.warning(
                f"Failed to request cancellation for local trial {job_id}: {e}"
            )

    # Phase 2 — Wait and force cleanup
    if running_jobs:
        futures_list = [future for _, future in running_jobs]

        logger.info(
            f"Waiting {self.CANCELLATION_DETECTION_WAIT}s for local trials "
            "to detect cancellation..."
        )
        done, not_done = concurrent.futures.wait(
            futures_list,
            timeout=self.CANCELLATION_DETECTION_WAIT,
        )
        logger.info(
            f"{len(done)} local trial(s) stopped after cancellation request, "
            f"{len(not_done)} still running"
        )

        for job_id, future in running_jobs:
            if future.done():
                continue

            controller = self.active_controllers.get(job_id)
            if controller is None:
                logger.warning(
                    f"Inconsistent state: future exists for trial {job_id} "
                    "but no controller found — skipping forced cleanup"
                )
                continue

            try:
                controller.cleanup_resources()
                logger.debug(f"Forced cleanup for local trial {job_id}")
            except Exception as e:
                logger.warning(
                    f"Failed forced cleanup for local trial {job_id}: {e}"
                )

        # Phase 3 — Last chance
        if any(not future.done() for _, future in running_jobs):
            concurrent.futures.wait(
                futures_list,
                timeout=self.GRACEFUL_CLEANUP_TIMEOUT,
            )

    # Finalization
    orphaned = [job_id for job_id, future in running_jobs if not future.done()]
    if orphaned:
        logger.warning(
            f"{len(orphaned)} trial(s) could not be stopped and were abandoned: "
            f"{orphaned}"
        )

    self.active_futures.clear()
    self.active_controllers.clear()
    logger.info("Completed cleanup of all active local trials")

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
VincentG1234 added a commit that referenced this pull request May 20, 2026
## Summary
Add structured documentation for human maintainers and coding agents:
`AGENTS.md`, `.ai/context/`, `.ai/skills/`, and a user-facing
architecture guide with Mermaid diagrams. Link the new doc from README
and quick start. Align the local example YAML with vLLM V1 constraints.

## Why
The InseeFrLab fork needs a stable onboarding path for contributors and
agents (local backend focus, safe commands, known issues/PRs) without
reading the whole codebase. Architecture was previously only implicit in
code; diagrams improve onboarding and reviews.

## What changed
- `AGENTS.md` — entry point for agents (context files, skills, safe
commands).
- `.ai/context/` — repo map, execution flow, history, known issues,
current work snapshot, external links.
- `.ai/skills/` — pr-writer, pr-reviewer, test-writer, docs-writer,
architecture-diagrams.
- `docs/architecture.md` — end-to-end, layout, orchestration, trial
lifecycle, outputs (Mermaid).
- `README.md`, `docs/quick_start.md` — links to architecture doc.
- `examples/study_config_local_exec.yaml` — disable
`max_num_partial_prefills` (unsupported on V1; comment added).

## How tested
- [x] `ruff check .`
- [x] `pytest -v tests/`
- [x] Manual E2E (maintainer): not required (docs-only PR)

## Risks / limitations
- `.ai/context/current-work.md` is a point-in-time snapshot (open PRs
#13, #17, #21, #22); it will drift until refreshed after merges.
- Mermaid rendering depends on the viewer (GitHub, IDE); no runtime
behavior change except the example YAML default.

## Links
- (none — no issue closed)

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants