-
Notifications
You must be signed in to change notification settings - Fork 779
Feature/log and eval mode #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
添加多个运行脚本以支持不同环境下的 SQL Agent 执行 新增 `run.sh`、`run_local.sh`、`run_qwen2.5_coder_local.sh` 和 `run_train_local.sh` 脚本, 用于配置并启动 Spider 数据集上的 SQL Agent 任务。 支持本地部署的 vLLM 兼容 API、训练及评估流程的一键执行。
|
Opened by mistake,sry!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive logging infrastructure and evaluation mode support to the Spider SQL agent training system. The changes enable better monitoring of training progress, hardware utilization, and provide flexible configuration management through external JSON/YAML files.
Key Changes
- Added infrastructure for detailed logging including hardware snapshots, GPU/CPU monitoring, progress tracking, and sequence length analysis
- Introduced support for external configuration files (JSON/YAML) with CLI argument
--config-file - Created shell scripts for launching training and evaluation workflows with different model configurations
Reviewed changes
Copilot reviewed 12 out of 15 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| run_train_local.sh | New training launcher script with nohup support and environment logging |
| run_qwen2.5_coder_local.sh | Evaluation script for Qwen2.5-Coder-0.5B against local vLLM endpoint |
| run_local.sh | Evaluation script for Qwen3-4B against local vLLM endpoint |
| run.sh | Evaluation script using Aliyun DashScope API |
| examples/spider/train_sql_agent.py | Enhanced training script with config file loading, hardware monitoring, and logging infrastructure |
| examples/spider/sql_agent_eval.py | New comprehensive evaluation script with concurrency support and token tracking |
| examples/spider/doc/README.md | Chinese documentation describing the logging system changes |
| examples/spider/configs/singleGPU_qwen05b.json | Single GPU configuration for Qwen2.5-Coder-0.5B |
| examples/spider/configs/doubleGPU_qwen05b.json | Dual GPU configuration for Qwen2.5-Coder-1.5B |
| agentlightning/verl/trainer.py | Enhanced trainer with progress logging, sequence metrics, and validation summaries |
| .gitignore | Added specific log directory entries |
| result/qwen2.5-coder-1.5b-insruct-dashscopeApi.txt | Evaluation results for 1.5B model |
| result/qwen2.5-coder-0.5b.txt | Evaluation results for 0.5B model |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # export WANDB_DISABLED=true | ||
|
|
||
| # 配置路径(直接在此处硬编码,修改更方便) | ||
| CONFIG_PATH="/home/lthpc/student/LiTengfei/project/myfork/agent-lightning/examples/spider/configs/singleGPU_qwen05b.json" |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded absolute path with user-specific directory makes this configuration non-portable. Consider using a relative path or an environment variable that can be set per user/environment.
| CONFIG_PATH="/home/lthpc/student/LiTengfei/project/myfork/agent-lightning/examples/spider/configs/singleGPU_qwen05b.json" | |
| CONFIG_PATH="$ROOT/examples/spider/configs/singleGPU_qwen05b.json" |
| } | ||
| }, | ||
| "model": { | ||
| "path": "/home/lthpc/student/LiTengfei/LLaMA-Factory/models/Qwen2.5-Coder-1.5B-Instruct", |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded absolute path containing user-specific directory makes this configuration non-portable. Consider using a relative path or environment variable substitution to work across different development environments.
| "path": "/home/lthpc/student/LiTengfei/LLaMA-Factory/models/Qwen2.5-Coder-1.5B-Instruct", | |
| "path": "${MODEL_DIR}/Qwen2.5-Coder-1.5B-Instruct", |
| } | ||
| }, | ||
| "model": { | ||
| "path": "/home/lthpc/student/LiTengfei/LLaMA-Factory/models/Qwen2.5-Coder-1.5B-Instruct", |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename is 'doubleGPU_qwen05b.json' suggesting a 0.5B model, but the model path points to 'Qwen2.5-Coder-1.5B-Instruct' which is a 1.5B model. This inconsistency between the filename and the actual model configuration could lead to confusion.
| } | ||
| }, | ||
| "model": { | ||
| "path": "/home/lthpc/student/LiTengfei/LLaMA-Factory/models/Qwen2.5-Coder-0.5B-Instruct", |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded absolute path containing user-specific directory makes this configuration non-portable. Consider using a relative path or environment variable substitution to work across different development environments.
| "path": "/home/lthpc/student/LiTengfei/LLaMA-Factory/models/Qwen2.5-Coder-0.5B-Instruct", | |
| "path": "${MODEL_PATH}", |
| "total_epochs": 2, | ||
| "log_interval": 1, | ||
| "save_freq": 500, | ||
| "default_local_dir": "/home/lthpc/student/LiTengfei/project/myfork/agent-lightning/examples/spider/ckpt_1.5b_grpo_config2" |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded absolute path containing user-specific directory makes this configuration non-portable. Consider using a relative path or environment variable substitution to work across different development environments.
| props = torch.cuda.get_device_properties(idx) | ||
| lines.append( | ||
| f"gpu_{idx}: name={props.name}, total_mem_gb={round(props.total_memory/1e9,2)}, " | ||
| f"capability={props.major}.{props.minor}" |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the word 'unreasoable' - it should be 'unreasonable'.
| lock = threading.Lock() | ||
| throttle_lock = threading.Lock() | ||
| last_call_ts = [0.0] | ||
| min_interval = 0.5 # seconds between starting any two requests to respect rate limits |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 0.5 for minimum interval between requests lacks explanation. Consider adding a comment explaining this rate limit value or making it configurable, especially since different API providers may have different rate limits.
| self._progress_log( | ||
| f"[{ts_val}] [val] step {self.global_steps}/{self.total_training_steps} | {key}={val}" | ||
| ) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _progress_log method is called but is never defined. This will result in an AttributeError at runtime. The method needs to be defined in the class, likely writing to a progress log file configured via config.trainer.progress_log_file.
|
|
||
| cpu_percent = str(psutil.cpu_percent(interval=None)) | ||
| mem_percent = str(psutil.virtual_memory().percent) | ||
| except Exception: |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Ignore all exceptions: missing CPU/memory stats are non-fatal and will be recorded as 'na' |
| try: | ||
| load_avg_vals = os.getloadavg() | ||
| load_avg = ",".join(str(round(v, 2)) for v in load_avg_vals) | ||
| except Exception: |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
No description provided.