Skip to content

Question: Any way to reduce memory for spider example #389

@hkvision

Description

@hkvision

Hi, new to agent-lightning, I'm now trying to run the rl example: https://github.com/microsoft/agent-lightning/tree/v0.2.2/examples/spider

In the README, it says this example requires at least one 40GB GPU, but I only have one 24G, asking if there's some configurations to make it less memory-consuming?
I have tried to change to "gpu_memory_utilization": 0.4 and reduce some batch sizes in the config from 8 to 2, but still get the OOM error

File "/home/arda/miniforge3/envs/kai-agentic/lib/python3.12/site-packages/torch/optim/adam.py", line 181, in _init_group
    state["exp_avg_sq"] = torch.zeros_like(
                          ^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 180.00 MiB. GPU 0 has a total capacity of 23.52 GiB of which 117.31 MiB is free. Including non-PyTorch memory, this process has 23.25 GiB memory in use. Of the allocated memory 30.63 GiB is allocated by PyTorch, with 231.19 MiB allocated in private pools (e.g., CUDA Graphs), and 64.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Also, there's too much log during the run, which makes me confused to find the key logs, any way to hide useless/duplicate logs?

Image

Thanks so much for the help in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions