A command-line speech-to-text tool that transcribes audio/video files using OpenAI Whisper, with optional speaker diarization (who said what) via pyannote-audio. Outputs plain text, timestamped transcripts, or SRT subtitle files.
- Transcribe any audio or video file (mp3, mp4, wav, m4a, flac, etc.)
- Speaker diarization — identify and label individual speakers
- SRT subtitle output — ready for Premiere Pro, DaVinci Resolve, etc.
- Microphone recording — record directly from your mic and transcribe live
- Five Whisper model sizes to balance speed vs accuracy
- GPU acceleration — NVIDIA (CUDA), AMD/Intel on Windows (DirectML), or CPU fallback
pip install openai-whisper torch pyannote.audio sounddevice soundfile numpypip install openai-whisper torch pyannote.audio sounddevice soundfile numpy torch-directmlThe script auto-detects DirectML when torch-directml is installed. No other changes needed.
Note: Speaker diarization (
--speakers) always runs on CPU when using DirectML — pyannote does not support it.
pip install openai-whisper torch pyannote.audio sounddevice soundfile numpySpeaker labelling requires a free Hugging Face account and model access:
- Create an account at huggingface.co
- Accept the model terms at:
- Get your token at huggingface.co/settings/tokens
- Login via CLI:
Or pass your token directly with
huggingface-cli login
--hf-token YOUR_TOKEN
# Basic transcription
python transcribe.py audio.mp3
# With speaker labels
python transcribe.py meeting.mp4 --speakers
# SRT subtitles with speaker labels (auto-saved as meeting.srt)
python transcribe.py meeting.mp4 --speakers --srt
# Specify exact number of speakers (improves accuracy)
python transcribe.py interview.mp4 --speakers --num-speakers 2 --srt
# Use a larger model for better accuracy
python transcribe.py meeting.mp4 --speakers --model large-v3 --srt
# Save transcript to a file
python transcribe.py audio.mp3 --speakers --timestamps -o transcript.txt
# Record from microphone (30 seconds)
python transcribe.py --record
# Record for 60 seconds with speaker labels
python transcribe.py --record --duration 60 --speakers --srt| Argument | Short | Description |
|---|---|---|
audio |
Path to audio/video file | |
--model |
-m |
Whisper model size (default: base) |
--output |
-o |
Save transcript to file |
--timestamps |
-t |
Include timestamps in output |
--srt |
Output as SRT subtitle file | |
--speakers |
Enable speaker diarization | |
--num-speakers |
Number of speakers (improves accuracy) | |
--hf-token |
Hugging Face token for pyannote | |
--record |
-r |
Record from microphone |
--duration |
-d |
Recording duration in seconds (default: 30) |
--verbose |
-v |
Show detailed progress |
| Model | Parameters | Notes |
|---|---|---|
tiny |
39M | Fastest, least accurate |
base |
74M | Good default |
small |
244M | Noticeably better accuracy |
medium |
769M | High accuracy |
large-v3 |
1550M | Best available (slow on CPU) |
Tip: On CPU, stick to
baseorsmall. Uselarge-v3if you have a GPU.
Docker runs CPU-only (AMD GPU passthrough is not supported on Windows Docker).
# Create a folder for your audio files
mkdir audio
# Copy your files in
cp meeting.mp4 audio/docker compose build
# Basic transcription
docker compose run transcribe /audio/meeting.mp3
# With speaker labels and SRT output
docker compose run transcribe /audio/meeting.mp4 --speakers --srt
# With Hugging Face token for diarization
docker compose run transcribe /audio/meeting.mp4 --speakers --hf-token YOUR_TOKEN --srtOutput files (.srt, .txt) are saved inside /audio/ in the container, which maps back to your local ./audio/ folder.
Note: The
--record(microphone) flag does not work inside Docker.
Plain text:
Hello everyone, welcome to today's meeting. Let's get started.
With timestamps and speakers:
[00:00:01.000 --> 00:00:03.500] [Speaker 1] Hello everyone, welcome to today's meeting.
[00:00:04.000 --> 00:00:06.200] [Speaker 2] Thanks for joining. Let's get started.
SRT format:
1
00:00:01,000 --> 00:00:03,500
[Speaker 1] Hello everyone, welcome
to today's meeting.
2
00:00:04,000 --> 00:00:06,200
[Speaker 2] Thanks for joining.
Let's get started.