A FastAPI-based web service that generates transcripts and VTT subtitle files from audio and video files using OpenAI's Whisper model.
- 🎵 Support for multiple audio/video formats (MP3, MP4, WAV, M4A, FLAC, etc.)
- 📝 Generate text transcripts
- 🎬 Generate VTT subtitle files with timestamps
- 🌍 Multi-language support with auto-detection
- 🔄 Translation to English
- ⚡ Multiple Whisper model sizes (tiny to large)
- 📊 Detailed API responses with timestamps and segments
- 🔍 Health check and model information endpoints
-
Clone or create the project directory:
cd whisper-transcript -
Create a virtual environment (recommended):
python -m venv venv # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Install FFmpeg (required for audio/video processing):
Windows:
- Download from https://ffmpeg.org/download.html
- Add to PATH environment variable
macOS:
brew install ffmpeg
Linux:
sudo apt update sudo apt install ffmpeg
python main.pyThe API will be available at http://localhost:8000
Once the server is running, visit:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
GET /health
GET /models
POST /transcribe
Parameters:
file: Audio/video file (multipart/form-data)language: Language code (optional, auto-detect if not provided)task: "transcribe" or "translate" (default: "transcribe")return_timestamps: Boolean (default: true)return_vtt: Boolean (default: true)
Example Response:
{
"filename": "audio.mp3",
"language": "en",
"text": "Hello, this is a test transcription...",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, this is a test"
}
],
"vtt": "WEBVTT\n\n1\n00:00:00.000 --> 00:00:02.500\nHello, this is a test\n\n",
"timestamp": "2024-01-01T12:00:00"
}POST /transcribe-with-files
Creates separate files for transcript, VTT, and full JSON results.
Use the provided test script:
python test_api.pyOr test with cURL:
curl -X POST "http://localhost:8000/transcribe" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_audio_file.mp3" \
-F "language=en" \
-F "return_vtt=true"- Audio: MP3, WAV, M4A, FLAC, AAC, OGG
- Video: MP4, AVI, MOV, MKV, WebM
- tiny: Fastest, least accurate (~32x realtime)
- base: Good balance of speed and accuracy (~16x realtime) - Default
- small: Better accuracy, slower (~6x realtime)
- medium: High accuracy (~2x realtime)
- large: Best accuracy, slowest (~1x realtime)
You can change the model size by modifying the MODEL_SIZE variable in main.py.
You can set these environment variables:
WHISPER_MODEL_SIZE: Model size (default: "base")MAX_FILE_SIZE_MB: Maximum file size in MB (default: 25)API_HOST: Host to bind to (default: "0.0.0.0")API_PORT: Port to bind to (default: 8000)
# Windows PowerShell
$env:WHISPER_MODEL_SIZE="small"
$env:MAX_FILE_SIZE_MB="50"
python main.py
# Linux/macOS
export WHISPER_MODEL_SIZE="small"
export MAX_FILE_SIZE_MB="50"
python main.pyCreate a Dockerfile:
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "main.py"]Build and run:
docker build -t whisper-api .
docker run -p 8000:8000 whisper-api- First request may be slower as the Whisper model loads
- Model loading time depends on the selected model size
- GPU acceleration is automatically used if available (CUDA/Metal)
- Consider using smaller models for real-time applications
- FFmpeg not found: Ensure FFmpeg is installed and in your PATH
- CUDA out of memory: Use a smaller model size or reduce batch size
- File too large: Increase
MAX_FILE_SIZE_MBor compress your file - Import errors: Ensure all dependencies are installed correctly
MIT License