-
Notifications
You must be signed in to change notification settings - Fork 779
Open
Labels
Description
Hi, new to agent-lightning, I'm now trying to run the rl example: https://github.com/microsoft/agent-lightning/tree/v0.2.2/examples/spider
In the README, it says this example requires at least one 40GB GPU, but I only have one 24G, asking if there's some configurations to make it less memory-consuming?
I have tried to change to "gpu_memory_utilization": 0.4 and reduce some batch sizes in the config from 8 to 2, but still get the OOM error
File "/home/arda/miniforge3/envs/kai-agentic/lib/python3.12/site-packages/torch/optim/adam.py", line 181, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 180.00 MiB. GPU 0 has a total capacity of 23.52 GiB of which 117.31 MiB is free. Including non-PyTorch memory, this process has 23.25 GiB memory in use. Of the allocated memory 30.63 GiB is allocated by PyTorch, with 231.19 MiB allocated in private pools (e.g., CUDA Graphs), and 64.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Also, there's too much log during the run, which makes me confused to find the key logs, any way to hide useless/duplicate logs?
Thanks so much for the help in advance!