[IJCV] Taming Data and Transformers for Audio Generation

🚀 Check Out Our Latest Work! 🎥🔊

Video-to-Audio and Audio-to-Video Generation
Discover how we bridge the gap between video and audio generative models!

This is the official GitHub repository of the paper Taming Data and Transformers for Audio Generation.

Taming Data and Transformers for Audio Generation
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Vicente Ordonez,
IJCV

Introduction

Generating ambient sounds is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap , a high-quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demonstrate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. For more details, please visit our project webpage.

Updates

2025.04.25: Release GenAU-L-Full-HQ-Data model and add gradio demos.
2024.10.24: Code released!
2024.06.28: Paper released!

Reproducibility and Comparison with GenAU

To facilitate easier comparison with GenAU, we provide in this google drive link containing all AudioCaps test samples generated by the following GenAU models:

GenAU-L-Full-HQ-Data (1.25B parameters) trained with AutoRecap-XL filtered with CLAP score of 0.4 (20.7M samples)
GenAU-L-Autorecap (1.25B parameters) trained with AutoRecap (760k samples)
GenAU-S-Autorecap (493M parameters) trained with AutoRecap (760k samples)
GenAU-L-AC, 1.25B parameters model trained only on AudioCaps

Setup

Initialize a conda environment named genau by running:

conda env create -f environment.yaml
conda activate genau

Dataset Preparation

See Dataset Preparation for details on downloading and preparing the AutoCap dataset, as well as more information on organizing your custom dataset.

Audio Captioning (AutoCap)

See GenAU README for details on inference, training, and evaluating our audio captioner AutoCAP.

Audio Generation (GenAU)

See GenAU README for details on inference, training, finetuning, and evaluating our audio generator GenAU.

Citation

If you find this paper useful in your research, please consider citing our work:

@article{haji2026taming,
  title={Taming data and transformers for audio generation},
  author={Haji-Ali, Moayed and Menapace, Willi and Siarohin, Aliaksandr and Balakrishnan, Guha and Ordonez, Vicente},
  journal={International Journal of Computer Vision},
  volume={134},
  number={3},
  pages={87},
  year={2026},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
AutoCap		AutoCap
GenAU		GenAU
assets		assets
dataset_preperation		dataset_preperation
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[IJCV] Taming Data and Transformers for Audio Generation

🚀 Check Out Our Latest Work! 🎥🔊

Introduction

Updates

Reproducibility and Comparison with GenAU

Setup

Dataset Preparation

Audio Captioning (AutoCap)

Audio Generation (GenAU)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[IJCV] Taming Data and Transformers for Audio Generation

🚀 Check Out Our Latest Work! 🎥🔊

Introduction

Updates

Reproducibility and Comparison with GenAU

Setup

Dataset Preparation

Audio Captioning (AutoCap)

Audio Generation (GenAU)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages