Skip to content

JusperLee/AudioTrust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Logo

๐ŸŽง AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

๐Ÿ“œICLR'26 | ๐ŸŽถ Demo | ๐Ÿค— Dataset

่ฎฟๅฎข็ปŸ่ฎก GitHub stars Static Badge

AudioTrust is a large-scale benchmark designed to evaluate the multifaceted trustworthiness of Multimodal Audio Language Models (ALLMs). It examines model behavior across six critical dimensions:

๐Ÿ’ฅ News

๐Ÿ“Œ Table of Contents

๐Ÿ” Overview

  • ๐ŸŽฏ Hallucination: Fabricating content unsupported by audio
  • ๐Ÿ›ก๏ธ Robustness: Performance under audio degradation
  • ๐Ÿง‘โ€๐Ÿ’ป Authentication: Resistance to speaker spoofing/cloning
  • ๐Ÿ•ต๏ธ Privacy: Avoiding leakage of personal/private content
  • โš–๏ธ Fairness: Consistency across demographic factors
  • ๐Ÿšจ Safety: Generating safe, non-toxic, legal content

alt text alt text

The benchmark provides:

  • โœ… Expert-annotated prompts across six sub-datasets
  • ๐Ÿ”ฌ Model-vs-model evaluation with judge LLMs (e.g., GPT-4o)
  • ๐Ÿ“ˆ Baseline results and reproducible evaluation scripts

๐Ÿ“ Repository Structure

AudioTrust/
โ”œโ”€โ”€ assets/                        # Logo and visual assets
โ”œโ”€โ”€ audio_evals/                  # Core evaluation engine
โ”‚   โ”œโ”€โ”€ agg/                      # Metric aggregation logic
โ”‚   โ”œโ”€โ”€ dataset/                  # Dataset preprocessing
โ”‚   โ”œโ”€โ”€ evaluator/                # Scoring logic
โ”‚   โ”œโ”€โ”€ process/, models/, prompt/, lib/  # Support code
โ”‚   โ”œโ”€โ”€ eval_task.py              # Evaluation controller
โ”‚   โ”œโ”€โ”€ isolate.py                # Single model inference
โ”‚   โ”œโ”€โ”€ recorder.py               # Output logging
โ”‚   โ”œโ”€โ”€ registry.py               # Registry entrypoint
โ”‚   โ””โ”€โ”€ utils.py                  # Shared utilities
โ”‚
โ”œโ”€โ”€ registry/                     # Modular registry structure
โ”‚   โ”œโ”€โ”€ agg/, dataset/, eval_task/, evaluator/, model/, prompt/, process/, recorder/
โ”‚
โ”œโ”€โ”€ scripts/                      # Shell scripts per task
โ”‚   โ””โ”€โ”€ hallucination/
โ”‚       โ”œโ”€โ”€ inference/
โ”‚       โ””โ”€โ”€ evaluation/
โ”œโ”€โ”€ data/                         # Organized audio files by task
โ”‚   โ”œโ”€โ”€ hallucination/, robustness/, privacy/, fairness/, authentication/, safety/
โ”œโ”€โ”€ res/                          # Outputs and logs
โ”œโ”€โ”€ tests/, utils/                # Tests and preprocessing
โ”œโ”€โ”€ main.py                       # Main execution entry
โ”œโ”€โ”€ requirments.txt
โ”œโ”€โ”€ requirments-offline-model.txt
โ””โ”€โ”€ README.md

๐Ÿ“ฆ Dataset Description

  • Language: English
  • Audio Format: WAV, mono, 16kHz
  • Size: ~10.4GB across 6 sub-datasets

Each sample includes:

  • Audio: decoded waveform (if using Hugging Face loader)
  • AudioPath: path to original WAV file
  • InferencePrompt: prompt used for model response generation
  • EvaluationPrompt: prompt for evaluator model
  • Ref: reference (expected) answer for scoring

Sub-datasets:

  • {hallucination, robustness, authentication, privacy, fairness, safety}

๐Ÿงช Scripts Overview

Each subtask contains:

Folder Purpose
inference/ Use a target model (e.g., Gemini) to generate responses
evaluation/ Use an evaluator model (e.g., GPT-4o) to assess generated outputs

This supports model-vs-model evaluation pipelines.

๐Ÿงฉ Example: Hallucination Task

scripts/hallucination/
โ”œโ”€โ”€ inference/
โ”‚   โ””โ”€โ”€ gemini-2.5-pro.sh
โ””โ”€โ”€ evaluation/
    โ””โ”€โ”€ gpt-4o.sh

๐Ÿš€ Quick Start

1. Install Dependencies

git clone https://github.com/JusperLee/AudioTrust.git
cd AudioTrust
pip install -r requirments.txt

Or for offline model use:

pip install -r requirments-offline-model.txt

2. Load Dataset from Hugging Face

from datasets import load_dataset
dataset = load_dataset("JusperLee/AudioTrust", split="hallucination")

Materialize the HF dataset to the project data/ layout

If you plan to run the evaluation scripts that expect a local data/ folder, first materialize the Hugging Face dataset into the required directory structure:

python utils/materialize_hf_audio.py --dataset-path JusperLee/AudioTrust

3. Run Inference and Evaluation

# Make sure your API keys are set before running:
export OPENAI_API_KEY=your-openai-api-key
export GOOGLE_API_KEY=your-google-api-key

# Step 1: Run inference with Gemini
bash scripts/hallucination/inference/gemini-2.5-pro.sh

# Step 2: Run evaluation using GPT-4o
bash scripts/hallucination/evaluation/gpt-4o.sh

Or directly with Python:

export OPENAI_API_KEY=your-openai-api-key
python main.py \
  --dataset hallucination-content_mismatch \
  --prompt hallucination-inference-content-mismatch-exp1-v1 \
  --model gemini-1.5-pro

๐Ÿ“Š Benchmark Tasks

Task Metric Description
Hallucination Detection Accuracy / Recall Groundedness of response in audio
Robustness Evaluation Accuracy / ฮ” Score Performance drop under corruption
Authentication Testing Attack Success Rate Resistance to spoofing / voice cloning
Privacy Leakage Leakage Rate Does the model leak private content?
Fairness Auditing Bias Index Demographic response disparity
Safety Assessment Violation Score Generation of unsafe or harmful content

๐Ÿ“Œ Citation

@inproceedings{li2025audiotrust,
  title={AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models},
  author={Li, Kai and Shen, Can and Liu, Yile and Han, Jirui and Zheng, Kelong and Zou, Xuechao and Wang, Zhe and Du, Xingjian and Zhang, Shun and Luo, Hanjun and others},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

๐Ÿ™ Acknowledgements

We gratefully acknowledge UltraEval-Audio for providing the core infrastructure that inspired and supported parts of this benchmark.

๐Ÿ“ฌ Contact

For questions or collaboration inquiries:

About

AudioTrust: Benchmarking the Multi-faceted Trustworthiness of Audio Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors