A Python script that transcribes audio files using OpenAI's Whisper and identifies different speakers using machine learning clustering techniques.
- 🎙️ Accurate transcription using OpenAI Whisper (local processing, no API required)
- 👥 Speaker diarization - automatically identifies different speakers
- ⏰ Timestamp support - precise timing for each speech segment
- 🗣️ Multi-language support - currently optimized for Russian
- 🔧 Configurable - adjustable chunk size, model quality, and number of speakers
- 💾 No external dependencies - runs completely offline after initial setup
- Python 3.8+
- FFmpeg (for audio processing)
- macOS, Linux, or Windows
-
Clone the repository:
git clone <your-repo-url> cd whisper-speaker-diarization
-
Create and activate virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install FFmpeg:
macOS (using Homebrew):
brew install ffmpeg
Ubuntu/Debian:
sudo apt update sudo apt install ffmpeg
Windows: Download from https://ffmpeg.org/download.html
python transcribe_audio.py path/to/your/audio.wavpython transcribe_audio.py
# Script will prompt you to enter the audio file pathEdit the script to adjust settings:
CHUNK_LENGTH_MIN = 10 # Audio chunk size in minutes
MODEL_NAME = "base" # Whisper model: tiny, base, small, medium, large
NUM_SPEAKERS = 2 # Expected number of speakers
OUTPUT_FILE = "transcription.txt" # Output filename| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 39MB | Fastest | Basic |
| base | 142MB | Fast | Good |
| small | 466MB | Medium | Better |
| medium | 1.5GB | Slow | Very Good |
| large | 2.9GB | Slowest | Best |
The script generates a timestamped transcription with speaker identification:
[00:00 - 00:05] Спикер 1: У нас будет группы 5 человек и по 4.
[00:05 - 00:11] Спикер 2: Мы группы образуем спортивным образом.
[00:11 - 00:14] Спикер 1: Почитаемся на первую, вторую, первую и первую.
- Audio Loading: Uses Whisper's built-in audio loading with FFmpeg
- Chunking: Splits long audio into manageable chunks (default 10 minutes)
- Transcription: Each chunk is transcribed using Whisper
- Feature Extraction: Extracts audio features for each speech segment:
- Signal energy
- Zero crossing rate
- Spectral centroid and bandwidth
- Mel-frequency coefficients
- Speaker Clustering: Uses K-means clustering to group segments by speaker
- Output Generation: Combines transcription with speaker labels and timestamps
[ERROR] FFmpeg не найден!
Solution: Install FFmpeg following the installation instructions above.
The script automatically handles SSL certificate issues during model download.
For very long audio files, consider:
- Using a smaller Whisper model (
tinyorbase) - Reducing
CHUNK_LENGTH_MIN - Processing shorter segments
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.
- OpenAI Whisper for speech recognition
- scikit-learn for machine learning clustering
- FFmpeg for audio processing
- GUI interface
- Real-time processing
- More sophisticated speaker diarization
- Support for more languages
- Docker containerization
- Web interface