Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 23 additions & 24 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@

## Project

`noisekit` is a `uvx`-compatible Python CLI that generates degraded speech datasets from clean HuggingFace corpora. It simulates seven atomic audio degradation scenarios — telecom (G.711 calls), low-bitrate codec compression, noisy environments (real ambient noise), far-field reverb, transmission dropout, and clipping distortion — plus compound multi-condition scenarios built by chaining atomic presets. Designed for ASR noise-robustness benchmarking. A `clean_reference` control completes the catalog.
`noisekit` is a `uvx`-compatible Python CLI that generates degraded speech datasets from clean HuggingFace corpora. It simulates six atomic audio degradation scenarios — telecom (G.711 calls), low-bitrate codec compression, noisy environments (real ambient noise), far-field reverb, and clipping distortion — plus compound multi-condition scenarios built by chaining atomic presets. Designed for ASR noise-robustness benchmarking. A `clean_reference` control completes the catalog.

## Package Management

Use **UV** for everything: `uv add`, `uv run`, `uv sync`. Never use pip directly.

Key runtime dependencies: `audiomentations>=0.38`, `lameenc>=1.4` (pure-Python MP3 encoder used by `Mp3Compression` in `telecom` and `low_bitrate`; no system ffmpeg needed), `torchmetrics>=1.7.0` (NISQA scoring — downloads ~50 MB model weights to `~/.torchmetrics/NISQA/` on first use), `pyroomacoustics` (room acoustics simulation for `reverb_far_field` — now a core dependency, no extra install needed).
Key runtime dependencies: `audiomentations>=0.38`, `lameenc>=1.4` (pure-Python MP3 encoder used by `Mp3Compression` in `telecom` and `low_bitrate`; no system ffmpeg needed), `torchmetrics>=1.7.0` (NISQA scoring — downloads ~50 MB model weights to `~/.torchmetrics/NISQA/` on first use), `pyroomacoustics` (room acoustics simulation for `reverb` — now a core dependency, no extra install needed).

## Architecture

Expand All @@ -23,7 +23,7 @@ noisekit/
├── dataset.py # HuggingFace dataset loading (soundfile decoder, no torchcodec)
├── transforms.py # Preset loading; returns PresetTransforms(full, scoring, scoring_sr)
├── scoring.py # PESQ + SNR + NISQA; PESQ NB at 8 kHz for telephony presets
├── noise_cache.py # Auto-downloads MUSAN music+noise for noisy_environment
├── noise_cache.py # Auto-downloads MUSAN music+noise for noise
└── presets/ # YAML preset files bundled with the package

```
Expand All @@ -32,7 +32,7 @@ noisekit/

```bash
noisekit generate --dataset <hf-name> --samples N --presets P1 P2 --output ./out --seed 42
noisekit generate ... --presets noisy_environment --noise-dir /path/to/noise_wavs
noisekit generate ... --presets noise --noise-dir /path/to/noise_wavs
noisekit generate ... --no-nisqa # skip NISQA (no model download, faster)
noisekit score ./audio_dir [--reference-dir ./ref] [--output scores.json]
noisekit score ./audio_dir --no-nisqa # skip NISQA for standalone scoring
Expand All @@ -41,7 +41,7 @@ noisekit list-presets [--verbose]

Custom presets: `--preset-file ./my_preset.yaml`

The `noisy_environment` preset uses a directory of background-noise WAVs. If `--noise-dir` is omitted, noisekit auto-downloads a small MUSAN **noise-only** subset (~20 files, ~120 MB) from `Aynursusuz/musan-audio-dataset` on HuggingFace to `~/.cache/noisekit/noise/musan_ambient/` on first use. Both `speech` and `music` classes are excluded: speech pollutes ASR/PESQ scoring; music sounds artificial as a background and is indistinguishable from white noise at low levels. Only label 2 (`noise` — wind, rain, traffic, machinery) is downloaded.
The `noise` preset uses a directory of background-noise WAVs. If `--noise-dir` is omitted, noisekit auto-downloads a small MUSAN **noise-only** subset (~20 files, ~120 MB) from `Aynursusuz/musan-audio-dataset` on HuggingFace to `~/.cache/noisekit/noise/musan_ambient/` on first use. Both `speech` and `music` classes are excluded: speech pollutes ASR/PESQ scoring; music sounds artificial as a background and is indistinguishable from white noise at low levels. Only label 2 (`noise` — wind, rain, traffic, machinery) is downloaded.

Pass `--noise-dir /path/to/wavs` to use your own corpus (e.g. MUSAN, DEMAND, FSD50K) instead. Inside a preset YAML, use the literal string `${NOISE_DIR}` as a parameter value and `transforms.load_preset` substitutes the resolved path at load time. Auto-download is wired in `pipeline.run_generate` via `noise_cache.ensure_default_noise_dir()`, gated by `transforms.preset_requires_noise_dir()`.

Expand Down Expand Up @@ -70,10 +70,9 @@ Built-in presets:
| `clean_reference` | Minimal gain normalization (PESQ ceiling) | full | WB 16 kHz | 4.0-4.5 |
| `telecom` | G.711 call + low-bitrate MP3 codec artifacts | 300-3400 Hz @ 8 kHz | NB 8 kHz | 2.0-3.5 |
| `low_bitrate` | Wideband low-bitrate MP3 compression (16-32 kbps) | 80-7500 Hz @ 16 kHz | WB 16 kHz | 1.5-2.5 |
| `noisy_environment` | Real ambient noise via `AddBackgroundNoise` | up to 8-12 kHz | WB 16 kHz | 2.0-3.5 |
| `clipping_distortion` | Microphone overload / ADC saturation (`ClippingDistortion` 10-25%) | full | WB 16 kHz | 2.0-3.5 |
| `transmission_dropout` | VoIP packet loss: 1-3 silent dropout windows | full | WB 16 kHz | 1.5-3.0 |
| `reverb_far_field` | Far-field reverberant room via `RoomSimulator` | full | WB 16 kHz | 2.0-3.5 |
| `noise` | Real ambient noise via `AddBackgroundNoise` | up to 8-12 kHz | WB 16 kHz | 2.0-3.5 |
| `clipping` | Microphone overload / ADC saturation (`ClippingDistortion` 10-25%) | full | WB 16 kHz | 2.0-3.5 |
| `reverb` | Far-field reverberant room via `RoomSimulator` | full | WB 16 kHz | 2.0-3.5 |

`telecom` and any compound preset ending with `telecom` use the 8 kHz PESQ NB scoring split (see below). All other presets score in PESQ WB at 16 kHz.

Expand All @@ -83,9 +82,9 @@ Compound presets chain two or more atomic presets together. Noise is added first

| Preset | Chain | Requires | PESQ mode | Target MOS |
| ------------------ | ----------------------------------------- | ------------- | --------- | ---------- |
| `noisy_telecom` | `noisy_environment` → `telecom` | `--noise-dir` | NB 8 kHz | 1.5-2.5 |
| `reverb_noisy` | `reverb_far_field` → `noisy_environment` | `--noise-dir` | WB 16 kHz | 1.0-2.5 |
| `clipping_telecom` | `clipping_distortion` → `telecom` | — | NB 8 kHz | 1.0-2.5 |
| `noise_telecom` | `noise` → `telecom` | `--noise-dir` | NB 8 kHz | 1.5-2.5 |
| `noise_reverb` | `noise` → `reverb` | `--noise-dir` | WB 16 kHz | 1.0-2.5 |
| `clipping_telecom` | `clipping` → `telecom` | — | NB 8 kHz | 1.0-2.5 |

### Compound Preset YAML Format

Expand All @@ -103,14 +102,14 @@ Rules:
- `chain` and `transforms` are mutually exclusive.
- Chained entries must be names of built-in atomic presets (no nesting chains).
- `${NOISE_DIR}` resolution and the PESQ NB scoring split are detected automatically across the full concatenated chain.
- `reverb_far_field` uses `pyroomacoustics` (bundled as a core dependency — no extra install needed).
- `reverb` uses `pyroomacoustics` (bundled as a core dependency — no extra install needed).

### Why no white noise

The catalog deliberately avoids `AddGaussianSNR` — white Gaussian noise sounds artificial and doesn't reflect real production audio. Instead:

- `telecom` and `low_bitrate` rely on `Mp3Compression` at 16-32 kbps for realistic codec smearing/pre-echo.
- `noisy_environment` uses `AddBackgroundNoise` over a user-supplied WAV corpus (MUSAN/DEMAND/FSD50K), so the noise floor matches the real environment you care about.
- `noise` uses `AddBackgroundNoise` over a user-supplied WAV corpus (MUSAN/DEMAND/FSD50K), so the noise floor matches the real environment you care about.

## PESQ Scoring — Important Design Decision

Expand All @@ -134,7 +133,7 @@ if peak > 1e-9:

**Safety:** The same normalized `ref_16k` is used as both the transform input and the PESQ/SNR reference, so all quality metrics remain valid relative comparisons. The mid-chain `Normalize` inside `telecom.yaml` (before `BitCrush`) is still needed separately — the bandpass filter removes energy and that step re-normalizes before quantization.

**`noisy_environment` also pre-normalizes:** `noisy_environment.yaml` adds a `Normalize` as its first transform. This handles the `reverb_noisy` compound case: `RoomSimulator` can attenuate the signal by ~10× at large mic distances; without the mid-chain normalize, `AddBackgroundNoise` would see the attenuated level and mix noise too quietly. All compound presets using `noisy_environment` inherit this fix automatically.
**`noise` also pre-normalizes:** `noise.yaml` adds a `Normalize` as its first transform. This handles the `noise_reverb` compound case: `RoomSimulator` can attenuate the signal by ~10× at large mic distances; without the mid-chain normalize, `AddBackgroundNoise` would see the attenuated level and mix noise too quietly. All compound presets using `noise` inherit this fix automatically.

`transforms.py` auto-detects this split: if the last transform is `Resample(16000)`, it creates a `scoring` Compose (all-but-last) alongside the `full` Compose.

Expand Down Expand Up @@ -184,19 +183,19 @@ cat test_out/metadata.jsonl
# New atomic presets — no external dependencies
uv run noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 3 --presets clipping_distortion transmission_dropout \
--samples 3 --presets clipping \
--no-nisqa --output ./test_atomic --seed 42

# noisy_environment — auto-downloads MUSAN noise-only clips on first run
# noise — auto-downloads MUSAN noise-only clips on first run
uv run noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 3 --presets noisy_environment \
--samples 3 --presets noise \
--output ./test_noise --seed 42

# Compound presets (auto-downloads MUSAN noise on first run)
uv run noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 3 --presets noisy_telecom \
--samples 3 --presets noise_telecom \
--no-nisqa --output ./test_compound --seed 42

# clipping_telecom — no noise dir needed
Expand All @@ -208,19 +207,19 @@ uv run noisekit generate \
# Far-field reverb
uv run noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 3 --presets reverb_far_field reverb_noisy \
--samples 3 --presets reverb noise_reverb \
--no-nisqa --output ./test_reverb --seed 42

# noisy_environment with your own noise corpus (skips auto-download)
# noise with your own noise corpus (skips auto-download)
uv run noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 3 --presets noisy_environment \
--samples 3 --presets noise \
--noise-dir ~/datasets/musan/noise \
--output ./test_noise --seed 42
```

Expected PESQ spread: clean ~4.6, telecom ~2.5-3.5 (NB), low_bitrate ~1.5-2.5 (WB), noisy_environment ~1.0-2.5 (WB), clipping_distortion ~2.0-3.5 (WB), transmission_dropout ~1.5-3.0 (WB), reverb_far_field ~2.0-3.5 (WB).
Expected PESQ spread: clean ~4.6, telecom ~2.5-3.5 (NB), low_bitrate ~1.5-2.5 (WB), noise ~1.0-2.5 (WB), clipping ~2.0-3.5 (WB), reverb ~2.0-3.5 (WB).

Compound preset PESQ: noisy_telecom ~1.5-2.5 (NB), clipping_telecom ~1.0-2.5 (NB), reverb_noisy ~1.0-2.5 (WB).
Compound preset PESQ: noise_telecom ~1.5-2.5 (NB), clipping_telecom ~1.0-2.5 (NB), noise_reverb ~1.0-2.5 (WB).

Expected NISQA spread: clean ~4.0-4.5, degraded presets ~1.5-3.0. NISQA model weights (~50 MB) are downloaded on first run.
31 changes: 15 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Generate degraded speech datasets for noise-robust ASR benchmarking.

Takes a clean HuggingFace speech dataset, applies real-world degradation presets via [audiomentations](https://github.com/iver56/audiomentations), and scores each output with PESQ, SNR, and NISQA, producing a JSONL manifest ready for noise-robustness benchmarking.

Seven atomic degradation scenarios are built in: telephony (G.711 + low-bitrate codec), wideband codec compression, ambient noise, clipping distortion, transmission dropout, and far-field reverb. Atomic presets compose into compound multi-condition scenarios.
Six atomic degradation scenarios are built in: telephony (G.711 + low-bitrate codec), wideband codec compression, ambient noise, clipping distortion, and far-field reverb. Atomic presets compose into compound multi-condition scenarios.

> [!NOTE]
> Degradations are programmatically simulated. Scores may not generalize to genuine production recordings; validate final benchmarks on annotated real-world data.
Expand Down Expand Up @@ -61,12 +61,12 @@ uvx noisekit generate \
--seed 42
```

For `noisy_environment`, supply a directory of real noise WAVs (e.g. [MUSAN](https://www.openslr.org/17/), [DEMAND](https://zenodo.org/record/1227121), or [FSD50K](https://zenodo.org/record/4060432)):
For `noise`, you can supply your own background-noise WAVs with `--noise-dir` (e.g. [MUSAN](https://www.openslr.org/17/), [DEMAND](https://zenodo.org/record/1227121), or [FSD50K](https://zenodo.org/record/4060432)):

```bash
uvx noisekit generate \
--dataset google/fleurs --config en_us --split test \
--samples 300 --presets noisy_environment \
--samples 300 --presets noise \
--noise-dir ~/datasets/musan/noise \
--output ./benchmark_dataset --seed 42
```
Expand Down Expand Up @@ -131,7 +131,7 @@ uvx noisekit list-presets --verbose # show full transform stack

## Presets

Ten built-in presets: seven atomic scenarios, three compound multi-condition presets, and a clean reference control. None use synthetic white noise; codec artifacts, real ambient recordings, and room simulation produce the degradation instead.
Nine built-in presets: six atomic scenarios, three compound multi-condition presets, and a clean reference control. None use synthetic white noise; codec artifacts, real ambient recordings, and room simulation produce the degradation instead.

### Atomic presets

Expand All @@ -140,34 +140,33 @@ Ten built-in presets: seven atomic scenarios, three compound multi-condition pre
| `clean_reference` | Minimal processing (PESQ ceiling / control) | 4.0-4.5 |
| `telecom` | G.711-style call: 8 kHz bandpass + 8-bit BitCrush + 16-32 kbps MP3 codec | NB 2.0-3.5 |
| `low_bitrate` | Wideband audio crushed by 16-32 kbps MP3 compression | WB 1.5-2.5 |
| `noisy_environment` | Real ambient noise from `--noise-dir` mixed in at SNR 5-15 dB | WB 1.0-2.5 |
| `clipping_distortion` | Microphone overload: clips the loudest 10-25% of samples | WB 2.0-3.5 |
| `transmission_dropout` | VoIP packet loss: 1-3 silent dropout windows (60-180 ms each) | WB 1.5-3.0 |
| `reverb_far_field` | Far-field room reverb at 1-3 m mic distance | WB 2.0-3.5 |
| `noise` | Real ambient noise from `--noise-dir` mixed in at SNR 5-15 dB | WB 1.0-2.5 |
| `clipping` | Microphone overload: clips the loudest 10-25% of samples | WB 2.0-3.5 |
| `reverb` | Far-field room reverb at 1-3 m mic distance | WB 2.0-3.5 |

`telecom` is scored with PESQ narrowband at 8 kHz (before the final upsample); all other presets are scored wideband at 16 kHz.

All atomic presets require no noise corpus. All dependencies, including `pyroomacoustics` (used by `reverb_far_field`), are bundled with no extra install needed.
All dependencies, including `pyroomacoustics` (used by `reverb`), are bundled with no extra install needed.

`noisy_environment` requires `--noise-dir` pointing at a directory of background-noise WAVs (e.g. MUSAN, DEMAND, FSD50K). If omitted, noisekit auto-downloads a small MUSAN noise-only subset (~120 MB) from HuggingFace on first use.
`noise` accepts a `--noise-dir` pointing at a directory of background-noise WAVs (e.g. MUSAN, DEMAND, FSD50K). If omitted, noisekit auto-downloads a small MUSAN noise-only subset (~20 files, ~120 MB) to `~/.cache/noisekit/noise/musan_ambient/` on first use.

### Compound presets

Compound presets chain two atomic presets together. Noise is applied first (acoustic environment), then codec or dropout (digital processing on the already-degraded signal).

| Preset | Chain | Requires | PESQ |
| ------------------ | ---------------------------------------- | ------------- | ---------- |
| `noisy_telecom` | `noisy_environment` → `telecom` | `--noise-dir` | NB 1.5-2.5 |
| `clipping_telecom` | `clipping_distortion` → `telecom` | (none) | NB 1.0-2.5 |
| `reverb_noisy` | `reverb_far_field` → `noisy_environment` | `--noise-dir` | WB 1.0-2.5 |
| Preset | Chain | Noise source | PESQ |
| ------------------ | ---------------------------------------- | ------------------------------ | ---------- |
| `noise_telecom` | `noise` → `telecom` | `--noise-dir` or auto-download | NB 1.5-2.5 |
| `clipping_telecom` | `clipping` → `telecom` | (none) | NB 1.0-2.5 |
| `noise_reverb` | `noise` → `reverb` | `--noise-dir` or auto-download | WB 1.0-2.5 |

You can also define your own compound preset with a `chain:` key in a YAML file:

```yaml
name: my_compound
description: "Noisy environment then telephony codec"
chain:
- noisy_environment
- noise
- telecom
```

Expand Down
5 changes: 3 additions & 2 deletions noisekit/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ def generate(
"--noise-dir",
help=(
"Directory of background-noise WAVs (e.g. MUSAN, DEMAND, FSD50K). "
"Used by noisy_environment. If omitted, a small MUSAN music+noise "
"subset is auto-downloaded to ~/.cache/noisekit/ on first use."
"Used by the noise preset and compound noise presets. "
"If omitted, a small MUSAN noise-only subset (~20 files, ~120 MB) "
"is auto-downloaded to ~/.cache/noisekit/noise/musan_ambient/ on first use."
),
),
] = None,
Expand Down
4 changes: 2 additions & 2 deletions noisekit/noise_cache.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Auto-download and cache MUSAN noise-only clips for noisy_environment.
"""Auto-download and cache MUSAN noise-only clips for noise.

Only the `noise` class (wind, rain, traffic, machinery…) is downloaded.
Speech and music are both excluded: speech pollutes ASR/PESQ scoring;
Expand Down Expand Up @@ -28,7 +28,7 @@ def get_default_noise_cache_dir() -> Path:


def ensure_default_noise_dir(num_samples: int = DEFAULT_NOISE_NUM_SAMPLES) -> Path:
"""Return a directory of MUSAN music+noise WAVs, downloading on first use."""
"""Return a directory of MUSAN noise-only WAVs, downloading on first use."""
cache_dir = get_default_noise_cache_dir()
cache_dir.mkdir(parents=True, exist_ok=True)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: clipping_distortion
name: clipping
description: "Amplitude clipping simulating microphone overload or ADC saturation (10–25% of peak samples)."
transforms:
- type: ClippingDistortion
Expand Down
2 changes: 1 addition & 1 deletion noisekit/presets/clipping_telecom.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: clipping_telecom
description: "Amplitude clipping over a telephone channel: ADC saturation then G.711-style narrowband codec."
chain:
- clipping_distortion
- clipping
- telecom
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: noisy_environment
description: "Real-world ambient background noise mixed at variable SNR (515 dB). Requires --noise-dir."
name: noise
description: "Real-world ambient background noise mixed at variable SNR (5-15 dB). Uses --noise-dir or auto-downloads MUSAN noise on first use."
transforms:
- type: Normalize
parameters: {}
Expand Down
5 changes: 5 additions & 0 deletions noisekit/presets/noise_reverb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
name: noise_reverb
description: "Far-field reverberant room with superimposed ambient background noise. Uses --noise-dir or auto-downloads MUSAN noise on first use."
chain:
- noise
- reverb
Loading