Audio Transcription with Speaker Diarization

A Python script that transcribes audio files using OpenAI's Whisper and identifies different speakers using machine learning clustering techniques.

Features

🎙️ Accurate transcription using OpenAI Whisper (local processing, no API required)
👥 Speaker diarization - automatically identifies different speakers
⏰ Timestamp support - precise timing for each speech segment
🗣️ Multi-language support - currently optimized for Russian
🔧 Configurable - adjustable chunk size, model quality, and number of speakers
💾 No external dependencies - runs completely offline after initial setup

Requirements

Python 3.8+
FFmpeg (for audio processing)
macOS, Linux, or Windows

Installation

Clone the repository:

git clone <your-repo-url>
cd whisper-speaker-diarization

Create and activate virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Install FFmpeg:

macOS (using Homebrew):
```
brew install ffmpeg
```
Ubuntu/Debian:
```
sudo apt update
sudo apt install ffmpeg
```
Windows: Download from https://ffmpeg.org/download.html

Usage

Basic Usage

python transcribe_audio.py path/to/your/audio.wav

Interactive Mode

python transcribe_audio.py
# Script will prompt you to enter the audio file path

Configuration

Edit the script to adjust settings:

CHUNK_LENGTH_MIN = 10      # Audio chunk size in minutes
MODEL_NAME = "base"        # Whisper model: tiny, base, small, medium, large
NUM_SPEAKERS = 2           # Expected number of speakers
OUTPUT_FILE = "transcription.txt"  # Output filename

Whisper Models

Model	Size	Speed	Accuracy
tiny	39MB	Fastest	Basic
base	142MB	Fast	Good
small	466MB	Medium	Better
medium	1.5GB	Slow	Very Good
large	2.9GB	Slowest	Best

Output Format

The script generates a timestamped transcription with speaker identification:

[00:00 - 00:05] Спикер 1: У нас будет группы 5 человек и по 4.
[00:05 - 00:11] Спикер 2: Мы группы образуем спортивным образом.
[00:11 - 00:14] Спикер 1: Почитаемся на первую, вторую, первую и первую.

How It Works

Audio Loading: Uses Whisper's built-in audio loading with FFmpeg
Chunking: Splits long audio into manageable chunks (default 10 minutes)
Transcription: Each chunk is transcribed using Whisper
Feature Extraction: Extracts audio features for each speech segment:
- Signal energy
- Zero crossing rate
- Spectral centroid and bandwidth
- Mel-frequency coefficients
Speaker Clustering: Uses K-means clustering to group segments by speaker
Output Generation: Combines transcription with speaker labels and timestamps

Troubleshooting

FFmpeg Not Found

[ERROR] FFmpeg не найден!

Solution: Install FFmpeg following the installation instructions above.

SSL Certificate Errors

The script automatically handles SSL certificate issues during model download.

Memory Issues

For very long audio files, consider:

Using a smaller Whisper model (tiny or base)
Reducing CHUNK_LENGTH_MIN
Processing shorter segments

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
transcribe_audio.py		transcribe_audio.py
transcription.txt		transcription.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Transcription with Speaker Diarization

Features

Requirements

Installation

Usage

Basic Usage

Interactive Mode

Configuration

Whisper Models

Output Format

How It Works

Troubleshooting

FFmpeg Not Found

SSL Certificate Errors

Memory Issues

Contributing

License

Acknowledgments

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription with Speaker Diarization

Features

Requirements

Installation

Usage

Basic Usage

Interactive Mode

Configuration

Whisper Models

Output Format

How It Works

Troubleshooting

FFmpeg Not Found

SSL Certificate Errors

Memory Issues

Contributing

License

Acknowledgments

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages