๐ICLR'26 | ๐ถ Demo | ๐ค Dataset
AudioTrust is a large-scale benchmark designed to evaluate the multifaceted trustworthiness of Multimodal Audio Language Models (ALLMs). It examines model behavior across six critical dimensions:
- [2026-01-26] AudioTrust got accepted to ICLR'26! ๐
- [2025-09-30] Added support for Kimi-Audio, Step-Fun, Step-Audio2, OpenS2S, and Qwen2.5-Omni.
- [2025-05-16] We release the AudioTrust benchmark! ๐
- ๐ Overview
- ๐ Repository Structure
- ๐ฆ Dataset Description
- ๐งช Scripts Overview
- ๐ Quick Start
- ๐ Benchmark Tasks
- ๐ Citation
- ๐ Acknowledgements
- ๐ฌ Contact
- ๐ฏ Hallucination: Fabricating content unsupported by audio
- ๐ก๏ธ Robustness: Performance under audio degradation
- ๐งโ๐ป Authentication: Resistance to speaker spoofing/cloning
- ๐ต๏ธ Privacy: Avoiding leakage of personal/private content
- โ๏ธ Fairness: Consistency across demographic factors
- ๐จ Safety: Generating safe, non-toxic, legal content
The benchmark provides:
- โ Expert-annotated prompts across six sub-datasets
- ๐ฌ Model-vs-model evaluation with judge LLMs (e.g., GPT-4o)
- ๐ Baseline results and reproducible evaluation scripts
AudioTrust/
โโโ assets/ # Logo and visual assets
โโโ audio_evals/ # Core evaluation engine
โ โโโ agg/ # Metric aggregation logic
โ โโโ dataset/ # Dataset preprocessing
โ โโโ evaluator/ # Scoring logic
โ โโโ process/, models/, prompt/, lib/ # Support code
โ โโโ eval_task.py # Evaluation controller
โ โโโ isolate.py # Single model inference
โ โโโ recorder.py # Output logging
โ โโโ registry.py # Registry entrypoint
โ โโโ utils.py # Shared utilities
โ
โโโ registry/ # Modular registry structure
โ โโโ agg/, dataset/, eval_task/, evaluator/, model/, prompt/, process/, recorder/
โ
โโโ scripts/ # Shell scripts per task
โ โโโ hallucination/
โ โโโ inference/
โ โโโ evaluation/
โโโ data/ # Organized audio files by task
โ โโโ hallucination/, robustness/, privacy/, fairness/, authentication/, safety/
โโโ res/ # Outputs and logs
โโโ tests/, utils/ # Tests and preprocessing
โโโ main.py # Main execution entry
โโโ requirments.txt
โโโ requirments-offline-model.txt
โโโ README.md- Language: English
- Audio Format: WAV, mono, 16kHz
- Size: ~10.4GB across 6 sub-datasets
Each sample includes:
Audio: decoded waveform (if using Hugging Face loader)AudioPath: path to original WAV fileInferencePrompt: prompt used for model response generationEvaluationPrompt: prompt for evaluator modelRef: reference (expected) answer for scoring
Sub-datasets:
{hallucination, robustness, authentication, privacy, fairness, safety}
Each subtask contains:
| Folder | Purpose |
|---|---|
inference/ |
Use a target model (e.g., Gemini) to generate responses |
evaluation/ |
Use an evaluator model (e.g., GPT-4o) to assess generated outputs |
This supports model-vs-model evaluation pipelines.
scripts/hallucination/
โโโ inference/
โ โโโ gemini-2.5-pro.sh
โโโ evaluation/
โโโ gpt-4o.shgit clone https://github.com/JusperLee/AudioTrust.git
cd AudioTrust
pip install -r requirments.txtOr for offline model use:
pip install -r requirments-offline-model.txtfrom datasets import load_dataset
dataset = load_dataset("JusperLee/AudioTrust", split="hallucination")If you plan to run the evaluation scripts that expect a local data/ folder, first materialize the Hugging Face dataset into the required directory structure:
python utils/materialize_hf_audio.py --dataset-path JusperLee/AudioTrust# Make sure your API keys are set before running:
export OPENAI_API_KEY=your-openai-api-key
export GOOGLE_API_KEY=your-google-api-key
# Step 1: Run inference with Gemini
bash scripts/hallucination/inference/gemini-2.5-pro.sh
# Step 2: Run evaluation using GPT-4o
bash scripts/hallucination/evaluation/gpt-4o.shOr directly with Python:
export OPENAI_API_KEY=your-openai-api-key
python main.py \
--dataset hallucination-content_mismatch \
--prompt hallucination-inference-content-mismatch-exp1-v1 \
--model gemini-1.5-pro| Task | Metric | Description |
|---|---|---|
| Hallucination Detection | Accuracy / Recall | Groundedness of response in audio |
| Robustness Evaluation | Accuracy / ฮ Score | Performance drop under corruption |
| Authentication Testing | Attack Success Rate | Resistance to spoofing / voice cloning |
| Privacy Leakage | Leakage Rate | Does the model leak private content? |
| Fairness Auditing | Bias Index | Demographic response disparity |
| Safety Assessment | Violation Score | Generation of unsafe or harmful content |
@inproceedings{li2025audiotrust,
title={AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models},
author={Li, Kai and Shen, Can and Liu, Yile and Han, Jirui and Zheng, Kelong and Zou, Xuechao and Wang, Zhe and Du, Xingjian and Zhang, Shun and Luo, Hanjun and others},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}We gratefully acknowledge UltraEval-Audio for providing the core infrastructure that inspired and supported parts of this benchmark.
For questions or collaboration inquiries:
- Kai Li: tsinghua.kaili@gmail.com, Xinfeng Li: lxfmakeit@gmail.com
- Project Page โ Coming Soon


