Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .ai/context/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Architecture — execution flow

**User diagrams:** [`docs/architecture.md`](../../docs/architecture.md).

## 1. Configuration loading

- YAML study file (`examples/study_config_local_exec.yaml` reference).
- `StudyConfig.from_file()` in `core/config.py` — parse, validate objectives, storage, parameters.
- CLI `optimize` (`cli/main.py`) copies config beside SQLite when `storage_file` is set.

## 2. Study setup

- `StudyController.create_from_config(LocalExecutionBackend, config)`:
- Optional PostgreSQL (`storage/postgres_utils.py`).
- `get_storage(config)` → SQLite / PostgreSQL (`storage/utils.py`).
- Optuna `Study` + sampler (TPE, Grid, NSGA-II, …).
- `CentralizedLogger` if logging block present.

## 3. Study loop

- Baselines: `_run_baseline_trials` → enqueue + run with default/static params.
- Optimization: `study.ask()` → `TrialConfig` → `backend.submit_trial()` → poll → `study.tell()`.
- Failures: error classification → trial user attrs.
- Optional `optimization.log_metrics` → extra user attrs (PR #22).

## 4. Backend (supported path)

**`LocalExecutionBackend`** (`execution/backends.py`): thread pool, `LocalTrialController`, `poll_trials`, `cleanup_all_trials`.

Legacy: `RayExecutionBackend` exists for upstream compatibility; not the fork focus.

## 5. Single trial

`BaseTrialController.run_trial()` (`execution/trial_controller.py`):

1. Validate imports (vllm, guidellm, optuna).
2. `GuideLLMBenchmark` from `benchmarks/providers.py`.
3. `_start_vllm_server()` → `_wait_for_server_ready()`.
4. State machine: `WAITING_FOR_VLLM` → `RUNNING_BENCHMARK`.
5. Metrics → objectives; `cleanup_resources()` on exit/cancel.

## 6. Storage & logs

- Optuna: `study.storage_file` or `study.database_url`.
- Logs: `logging/manager.py` (file and/or DB).
- Dashboard: `optuna_dashboard/start_optuna_dashboard.sh`.

## 7. Cleanup

- Per trial: kill vLLM + benchmark process group.
- Study end / interrupt: `backend.cleanup_all_trials()`, `shutdown()`.
- Known gap: orphan vLLM if parent killed abruptly (issue #2).
41 changes: 41 additions & 0 deletions .ai/context/current-work.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Current work

_Last updated from local `gh` / git — refresh before large changes._

## Open pull requests (InseeFrLab)

| PR | Branch | Objective | Status | Next step |
|----|--------|-----------|--------|-----------|
| [#22](https://github.com/InseeFrLab/auto-tuning-vllm/pull/22) | `FEAT/optuna-user-attrs-log-metrics` | `optimization.log_metrics` → Optuna user attrs for dashboard | OPEN | Review + merge; ensure docs/example match `StudyConfig` validation |
| [#21](https://github.com/InseeFrLab/auto-tuning-vllm/pull/21) | `fix/exclude-baseline-trials-budget` | Baselines must not increment `completed_trials` / consume `n_trials` | OPEN | Merge; run `pytest tests/core/test_study_controller.py` |
| [#17](https://github.com/InseeFrLab/auto-tuning-vllm/pull/17) | `fix/guidellm-cli-preflight` | GuideLLM CLI preflight + pin `vllm<=0.19` | OPEN | Resolve overlap with issue #19 / current `pyproject` vllm pin |
| [#13](https://github.com/InseeFrLab/auto-tuning-vllm/pull/13) | `fix/local-backend-cleanup` | Cooperative cancel + cleanup on local backend | OPEN | Merge after manual interrupt test |

## Remote branches (not all have open PRs)

| Branch | Notes |
|--------|--------|
| `origin/FEAT/custom-metrics` | Merged as #18 on main |
| `origin/FEAT/grid-cardinality-auto-switch` | Merged as #7 |
| `origin/FEAT/ray-optional` | Legacy: Ray optional extra (merged) |
| `origin/add-optuna-dashboard-example` | Dashboard launcher (#14 merged) |
| `origin/add-startup-timeout-baseline-run` | Startup timeout for baselines — **verify if merged or stale** |
| `origin/ci-setup` | CI workflow (#8) |
| `origin/renovate/configure` | Dependency bot config |

## README roadmap (main)

| Item | Status | Next step |
|------|--------|-----------|
| Comprehensive test suite | In progress (small `tests/` tree) | Add controller/backend tests per PR #21 pattern |
| CI runs tests strictly | Partial | Remove `pytest ... \|\| true` in `ci.yml` when suite is stable |
| Dependency pinning / hygiene | Open | Align `pyproject.toml` with supported vLLM/GuideLLM matrix |
| CLI validation / error messages | Open | Extend `StudyConfig` errors + Typer messages |
| Speculative decoding params | Future | Design parameter module + example YAML |
| Extra benchmark providers | Future | Implement `BenchmarkProvider` subclass |

## Maintainer TODO (fill if stale)

- **Active local branch:** `FEAT/optuna-user-attrs-log-metrics` — confirm whether uncommitted edits on `config.py` / `study_controller.py` belong in PR #22.
- **Production study configs:** _Add paths or naming convention used internally._
- **Target vLLM version for production:** _e.g. 0.19 vs 0.20+ — drives issue #19 resolution._
5 changes: 5 additions & 0 deletions .ai/context/diagrams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Diagrams (agents)

User-facing Mermaid diagrams live in **[`docs/architecture.md`](../../docs/architecture.md)**.

When changing structure, update that file (see [`.ai/skills/architecture-diagrams.md`](../skills/architecture-diagrams.md)).
37 changes: 37 additions & 0 deletions .ai/context/external-links.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# External links

## Repositories

| Resource | URL |
|----------|-----|
| Fork | https://github.com/InseeFrLab/auto-tuning-vllm |
| Upstream | https://github.com/openshift-psap/auto-tuning-vllm |
| GuideLLM | https://github.com/neuralmagic/guidellm |
| vLLM | https://github.com/vllm-project/vllm |

## Documentation

| Topic | URL |
|-------|-----|
| vLLM | https://docs.vllm.ai/ |
| Optuna | https://optuna.readthedocs.io/ |
| Optuna Dashboard | https://github.com/optuna/optuna-dashboard |

## In-repo

| Doc | Path |
|-----|------|
| Quick start | `docs/quick_start.md` |
| Configuration | `docs/configuration.md` |
| Architecture (diagrams) | `docs/architecture.md` |

## Legacy (Ray — not agent focus)

| Doc | Path |
|-----|------|
| Ray cluster | `docs/ray_cluster_setup.md` |
| Ray auto-start | `docs/ray_auto_start.md` |

## GitHub (fork)

Issues: https://github.com/InseeFrLab/auto-tuning-vllm/issues — see `current-work.md` for open PRs.
43 changes: 43 additions & 0 deletions .ai/context/history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# History — decisions to preserve

## Execution model

| Decision | Reference |
|----------|-----------|
| **Local backend is the product path** | `LocalExecutionBackend`; fork README |
| **Do not kill parent process group** on vLLM cleanup | upstream PR #92 |
| Ray backend kept as **legacy / optional** extra | `db1e9ab`; issue #3 — not primary development |

## Optuna / study

| Decision | Reference |
|----------|-----------|
| Baselines visible in Optuna dashboard | upstream PR #111 |
| Failed trial attrs for sampler | #93, #97 |
| Constraint sampling | #101 |
| Grid cardinality auto-switch | fork PR #7 |
| Custom metric expressions | fork PR #18 |
| `max_concurrent_trials` naming | upstream #122, #125 |

## Benchmarking

| Decision | Reference |
|----------|-----------|
| GuideLLM as default provider | `benchmarks/providers.py` |
| Process-group benchmark terminate | `BenchmarkProvider` |

## Config / vLLM

| Decision | Reference |
|----------|-----------|
| Versioned defaults in `schemas/vllm_defaults/` | `version_manager.py` |
| Config validation in Python (no separate JSON schema) | upstream #110 |

## Tooling

| Decision | Reference |
|----------|-----------|
| CI: Ruff + pytest matrix | fork PR #8 |
| Optuna Dashboard script | fork PR #14 |

Upstream: [openshift-psap/auto-tuning-vllm](https://github.com/openshift-psap/auto-tuning-vllm). Fork emphasizes **local execution**, tests, and dependency control.
22 changes: 22 additions & 0 deletions .ai/context/known-issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Known issues

Update this file when merging fixes (no separate triage skill).

| Title | Status | Link | Component | Next action |
|-------|--------|------|-----------|-------------|
| GuideLLM + vLLM ≥ 0.20 | open | [#19](https://github.com/InseeFrLab/auto-tuning-vllm/issues/19) | `providers.py`, deps | Merge #17 or document pins |
| GuideLLM + transformers ≥ 5 | open | [#15](https://github.com/InseeFrLab/auto-tuning-vllm/issues/15) | GuideLLM | Reproduce; track upstream |
| Orphan vLLM on parent stop | open | [#2](https://github.com/InseeFrLab/auto-tuning-vllm/issues/2) | `trial_controller.py` | Merge #13 |
| Local backend cleanup | fix pending | [#13](https://github.com/InseeFrLab/auto-tuning-vllm/pull/13) | `backends.py` | Merge PR |
| Baselines consume `n_trials` | fix pending | [#21](https://github.com/InseeFrLab/auto-tuning-vllm/pull/21) | `study_controller.py` | Merge PR |
| CI pytest non-blocking | open | `ci.yml` | CI | Remove `\|\| true` when stable |
| Basic usage tests | open | [#4](https://github.com/InseeFrLab/auto-tuning-vllm/issues/4) | `tests/` | Expand pytest |
| Ray removal / deprecation | open | [#3](https://github.com/InseeFrLab/auto-tuning-vllm/issues/3) | `backends.py` | Legacy only; local path default |

## Code TODOs

| File | Note |
|------|------|
| `cli/main.py` | Sync log streaming |
| `trial_controller.py` | Remove debug health logging |
| `config.py` | Split int/float range parameter types |
26 changes: 26 additions & 0 deletions .ai/context/repo-map.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Repository map

| Path | Role |
|------|------|
| `auto_tune_vllm/` | Python package |
| `auto_tune_vllm/cli/main.py` | Typer CLI: `optimize`, `resume`, `logs` |
| `auto_tune_vllm/core/config.py` | `StudyConfig.from_file()` |
| `auto_tune_vllm/core/study_controller.py` | Optuna loop, baselines, concurrency |
| `auto_tune_vllm/core/trial.py` | `TrialConfig`, `TrialResult` |
| `auto_tune_vllm/core/parameters.py` | Search-space types |
| `auto_tune_vllm/core/storage/` | Optuna storage, PostgreSQL helpers |
| `auto_tune_vllm/execution/backends.py` | `LocalExecutionBackend` (+ legacy Ray class) |
| `auto_tune_vllm/execution/trial_controller.py` | vLLM + GuideLLM + cleanup |
| `auto_tune_vllm/benchmarks/` | `GuideLLMBenchmark`, `BenchmarkConfig` |
| `auto_tune_vllm/logging/` | Centralized trial logs |
| `auto_tune_vllm/utils/` | Grid cardinality, vLLM CLI, versioned defaults |
| `auto_tune_vllm/schemas/vllm_defaults/` | Per-version default YAML |
| `docs/` | `quick_start.md`, `architecture.md`, `configuration.md` |
| `examples/` | Study YAMLs and demos |
| `tests/` | Pytest (`core/`, `execution/`) |
| `optuna_dashboard/` | Dashboard launcher + sample DB |
| `.github/workflows/ci.yml` | Ruff, pytest matrix |
| `pyproject.toml` | Dependencies and tooling |
| `README.md` | Install and usage |

**CLI:** `auto-tune-vllm` → `auto_tune_vllm.cli:main`
28 changes: 28 additions & 0 deletions .ai/skills/architecture-diagrams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Skill: Architecture diagrams

## User doc (source of truth)

**[`docs/architecture.md`](../../docs/architecture.md)** — Mermaid diagrams for contributors and users. Also linked from `README.md`.

Agent context: [`.ai/context/architecture.md`](../context/architecture.md) (prose only).

## When to update `docs/architecture.md`

| Change | Section |
|--------|---------|
| New package / module layout | Repository layout |
| Study or trial flow | End-to-end flow, Study orchestration, Single trial lifecycle |
| Storage / logs | Outputs per study |
| Import graph | Module dependencies |

## Rules

1. Use real module paths (`study_controller.py`).
2. Default path = **local** backend; Ray at most one sentence, no extra diagrams.
3. Mermaid only (GitHub renders natively).
4. PR: note “updated docs/architecture.md” when structure changes.

## Do not

- Duplicate full diagrams under `.ai/context/`.
- Run `auto-tune-vllm optimize` to validate diagrams.
24 changes: 24 additions & 0 deletions .ai/skills/docs-writer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Skill: Docs writer

## Scope

| Audience | Files |
|----------|-------|
| Users | `README.md`, `docs/quick_start.md`, `docs/configuration.md` |
| Examples | `examples/*.yaml` |
| Agents | `.ai/context/*` |

## Rules

1. Runnable commands: `pip install -e .`, `ruff`, `pytest`, `auto-tune-vllm --help` — E2E optimize only in maintainer sections.
2. YAML keys match `StudyConfig` in `core/config.py`.
3. Link GitHub issues instead of long incident writeups.
4. Structural changes → update `docs/architecture.md` per `architecture-diagrams.md`.

## Ray

Legacy user docs live in `docs/ray_*.md`; do not expand Ray in agent context unless deprecating.

## Agents must not document

“Run optimize to verify” as an agent step — see `AGENTS.md`.
70 changes: 70 additions & 0 deletions .ai/skills/pr-reviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Skill: PR reviewer

Review = **read the diff**, **reason about behavior**, optionally **lint + unit tests**.
**Never run the autotuner** (`auto-tune-vllm optimize`, `resume`, or any command that starts vLLM / GuideLLM / GPU work). Maintainers run end-to-end studies manually.

## Allowed commands (agents)

```bash
source venv/bin/activate
ruff check .
pytest -v tests/
# optional: basedpyright (if enabled locally)
```

## Review workflow

1. Read PR description and linked issues.
2. Walk changed files; trace call path from `cli/main.py` or `StudyController` when relevant.
3. Run `ruff check .` and `pytest -v tests/` if environment is available.
4. Record findings in the output format below.

## Config & CLI

- [ ] `StudyConfig.from_file()` — new fields validated; errors actionable.
- [ ] `examples/*.yaml` + `docs/configuration.md` aligned.
- [ ] Typer options in `cli/main.py` documented when added.

## Local execution path (primary)

- [ ] `LocalExecutionBackend` — submit/poll/cancel/cleanup semantics still coherent.
- [ ] `trial_controller.py` — vLLM + GuideLLM lifecycle, cancellation, `cleanup_resources()`.
- [ ] No regression for install **without** Ray (`pip install -e .` only).

## Optuna

- [ ] `study.ask()` / `study.tell()` paired; failures → `FAIL` + user attrs.
- [ ] Baseline vs optimization trial counting (`n_trials`, PR #21 context).
- [ ] Grid / sampler / multi-objective values consistent.

## Benchmarks & metrics

- [ ] `benchmarks/providers.py` — GuideLLM CLI args from `BenchmarkConfig`.
- [ ] Objective expressions match `ObjectiveConfig.valid_metrics_combined`.

## Tests & docs

- [ ] New behavior covered in `tests/` without mandatory GPU.
- [ ] User-facing docs updated when behavior or YAML changes.

## Legacy Ray (only if PR touches `RayExecutionBackend`)

- [ ] Optional import still works; no new hard dependency on `ray` in core install path.
- [ ] No Ray-specific review steps unless the diff is explicitly Ray-related.

## Output format

```markdown
### Blockers
- ...

### Questions
- ...

### Nits
- ...

### Checks run
- [ ] ruff
- [ ] pytest
```
Loading
Loading