Generate voiceovers in your own voice, locally, using Chatterbox TTS by Resemble AI. No subscriptions, no cloud, no usage limits. Everything runs on your machine.
Chatterbox is an open source neural TTS model. You give it a reference clip of your voice and a script, and it synthesizes speech in your voice. It's not an LLM — it's a diffusion-based audio model that captures your vocal identity (accent, cadence, intonation) and applies it to new text.
uv is a fast Python package manager. It handles dependencies automatically when you run the scripts — no separate pip install step needed.
Windows (PowerShell):
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"macOS:
curl -LsSf https://astral.sh/uv/install.sh | shOr via Homebrew:
brew install uvRequired only if your reference audio is in M4A or MP3 format. Skip if you already have a WAV file or are planning to use WAV only.
Windows:
winget install ffmpegOr via Chocolatey:
choco install ffmpegmacOS:
brew install ffmpegchmod +x voiceover.py
chmod +x voiceover_server.pyWindows does not use file permissions this way — skip this step.
The first time you run either script, Chatterbox's model weights (~3GB total) will be downloaded automatically from HuggingFace. This is a one-time download — subsequent runs load from disk instantly.
- Windows:
%USERPROFILE%\.cache\huggingface\hub\ - macOS:
~/.cache/huggingface/hub/
- WAV format preferred. M4A and MP3 also work (ffmpeg handles conversion automatically).
- Aim for at least 30–60 seconds of clean, natural speech.
- Record in a quiet space — no background noise, no music.
- Reading out Harvard sentences makes an excellent reference clip.
- Record your voice and save it as sample-voice.wav in this same directory.
Good for one-off generations. Loads the model fresh each run.
Windows:
# Basic usage
uv run voiceover.py --ref sample-voice.wav --text "Welcome to this video"
# From a text file
uv run voiceover.py --ref sample-voice.wav --file script.txt
# Custom output filename
uv run voiceover.py --ref sample-voice.wav --text "Your script" --out intro.wav
# Tweak voice parameters
uv run voiceover.py --ref sample-voice.wav --text "Your script" --exaggeration 0.4 --cfg 0.7macOS:
# Basic usage
./voiceover.py --ref sample-voice.wav --text "Welcome to this video"
# From a text file
./voiceover.py --ref sample-voice.wav --file script.txt
# Custom output filename
./voiceover.py --ref sample-voice.wav --text "Your script" --out intro.wav
# Tweak voice parameters
./voiceover.py --ref sample-voice.wav --text "Your script" --exaggeration 0.4 --cfg 0.7Loads the model once and keeps it warm. Use this when generating multiple voiceovers in one sitting — each generation is much faster since the model stays in memory.
Terminal 1 — start the server:
Windows:
uv run voiceover_server.py --ref sample-voice.wavmacOS:
./voiceover_server.py --ref sample-voice.wavTerminal 2 — generate as many times as you want:
# Basic
curl -X POST http://localhost:8765 -d "Welcome to this video"
# Custom output filename
curl -X POST "http://localhost:8765?out=intro.wav" -d "Your full script here..."
# Tweak parameters per request
curl -X POST "http://localhost:8765?exaggeration=0.4&cfg=0.7" -d "Your script"curl is available natively on both macOS and Windows 10/11.
Output files are saved in your current directory, auto-named voiceover_001.wav, voiceover_002.wav etc. unless you specify ?out=filename.wav.
Long scripts are automatically split into sentence-sized chunks, generated separately, and stitched into one seamless output file.
Stop the server: Ctrl+C in Terminal 1.
| Parameter | Default | Description |
|---|---|---|
--exaggeration |
0.5 |
Expressiveness. Lower = calmer, higher = more animated. Try 0.3–0.7. |
--cfg |
0.5 |
How closely the output follows your reference voice. Higher = more like you. Try 0.6–0.7 if it sounds too generic. |
Both parameters can also be passed per-request to the server via query string: ?exaggeration=0.4&cfg=0.7
These commands work on both macOS and Windows (requires ffmpeg installed):
# WAV → MP3
ffmpeg -i voiceover.wav voiceover.mp3
# WAV → MP4 (black screen video, useful for LinkedIn etc.)
ffmpeg -i voiceover.wav -f lavfi -i color=c=black:s=1280x720:r=24 -shortest -c:v libx264 -c:a aac voiceover.mp4- Device (Windows): If you have an NVIDIA GPU, PyTorch may use CUDA automatically for faster generation. CPU fallback works fine otherwise.
- Device (macOS): Chatterbox runs on CPU on Apple Silicon. MPS (Apple's GPU backend) is not used due to a PyTorch conv1d limitation at this model's output size. CPU on an M-series Mac is fast enough for this use case.
- Watermark: Generated audio includes an imperceptible neural watermark from Resemble AI (Perth watermarker). It does not affect audio quality and is not detected or flagged by YouTube or other platforms.
- Model cache: Safe to delete to free disk space — it will re-download on next run. Windows:
%USERPROFILE%\.cache\huggingface\hub\. macOS:~/.cache/huggingface/hub/. - Python version: Pinned to 3.11 in the script header.
uvhandles this automatically.