Skip to content

feat(rl): PPO training pipeline with smoke-train verification#6

Merged
thanhndv212 merged 1 commit into
masterfrom
feat/rl-training
Jun 21, 2026
Merged

feat(rl): PPO training pipeline with smoke-train verification#6
thanhndv212 merged 1 commit into
masterfrom
feat/rl-training

Conversation

@thanhndv212

Copy link
Copy Markdown
Owner

M2b — RL Training Pipeline (COMPLETES M2)

New: training/ package

File Purpose
train_ppo.py SB3 PPO entry point (CLI, SubprocVecEnv, VecNormalize)
eval_policy.py Deterministic evaluation → pos_rmse, success_rate, crash_rate, mean_reward
configs/ppo_hover.yaml YAML hyperparameters

New examples

  • examples/07_rl_hover.py — end-to-end: train PPO 50k steps, eval, print RMSE
  • examples/08_rl_vs_pid.py — head-to-head comparison vs cascaded PID on circular trajectory

pyproject.toml — optional RL extras

[project.optional-dependencies]
rl  = ["torch>=2.2", "stable-baselines3>=2.3", "gymnasium>=0.29", "tensorboard>=2.15", "pyyaml>=6.0"]
rl-dev = ["drones-sim[rl]", "wandb", "optuna>=3.5", "moviepy"]

Smoke-train verified ✅

  • 50,000 timesteps in 25 s at ~2,000 fps (CPU, 2 envs)
  • Model checkpoint saved to training/checkpoints/final.zip
  • python -m training.eval_policy --path training/checkpoints/final.zip loads + evaluates
  • Pipeline: train → save → load → eval — verified end-to-end

Tests — 2 new in tests/test_rl_training.py

  • Smoke-train 1000 steps + save/reload/predict
  • Self-contained eval metrics test (tiny model → metrics dict)
  • All guarded by @pytest.mark.skipif(not _HAS_SB3)

.gitignore

Added training/checkpoints/ and tb/ (TensorBoard logs)

============================== 92 passed ==============================

M1 + M2 — COMPLETE 🎉

6 PRs, 35+ files, 2,000+ lines of code, 92 tests passing:

  1. ✅ Quaternion dynamics plant (no gimbal lock)
  2. ✅ CI workflow (ruff + pytest, Python 3.10–3.12)
  3. ✅ Disturbance framework (wind, ground effect, payload drop, motor failure)
  4. ✅ Telemetry logging (CSV + JSON Lines)
  5. ✅ RL Gymnasium env (modular action/obs/reward/task)
  6. ✅ RL training pipeline (PPO smoke-trained, verified)

Adds a complete RL training stack built on Stable-Baselines3 + PyTorch:

training/
  train_ppo.py          — SB3 PPO CLI entry point (SubprocVecEnv + VecNormalize)
  eval_policy.py        — load checkpoint → deterministic evaluation metrics
  configs/ppo_hover.yaml — YAML hyperparameter config

examples/
  07_rl_hover.py        — end-to-end: train PPO 50k steps, eval, print RMSE
  08_rl_vs_pid.py       — head-to-head comparison vs cascaded PID on circular trajectory

pyproject.toml
  [project.optional-dependencies]
  rl     = [torch, stable-baselines3, gymnasium, tensorboard, pyyaml]
  rl-dev = [..., wandb, optuna, moviepy]

.gitignore: added training/checkpoints/ and tb/ (TensorBoard logs)

Smoke-train results:
  50,000 timesteps, 2 envs, ~2000 fps, 25 s wall-clock
  Model saves + loads successfully via eval_policy

Tests (tests/test_rl_training.py):
  Smoke-train 1000 steps + save/reload/predict
  Self-contained eval_policy test (tiny model → metrics dict)
  All guarded by @pytest.mark.skipif(not _HAS_SB3)

Full suite: 92 tests passed (ruff clean).
@thanhndv212 thanhndv212 merged commit 0c88ec7 into master Jun 21, 2026
4 checks passed
@thanhndv212 thanhndv212 deleted the feat/rl-training branch June 21, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant