Skip to content

scribbletune/voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Voice Cloning Voiceover Tools

Generate voiceovers in your own voice, locally, using Chatterbox TTS by Resemble AI. No subscriptions, no cloud, no usage limits. Everything runs on your machine.


How it works

Chatterbox is an open source neural TTS model. You give it a reference clip of your voice and a script, and it synthesizes speech in your voice. It's not an LLM — it's a diffusion-based audio model that captures your vocal identity (accent, cadence, intonation) and applies it to new text.


Requirements

1. Install uv

uv is a fast Python package manager. It handles dependencies automatically when you run the scripts — no separate pip install step needed.

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

macOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

Or via Homebrew:

brew install uv

2. Install ffmpeg

Required only if your reference audio is in M4A or MP3 format. Skip if you already have a WAV file or are planning to use WAV only.

Windows:

winget install ffmpeg

Or via Chocolatey:

choco install ffmpeg

macOS:

brew install ffmpeg

3. Make the scripts executable (macOS only)

chmod +x voiceover.py
chmod +x voiceover_server.py

Windows does not use file permissions this way — skip this step.


First run

The first time you run either script, Chatterbox's model weights (~3GB total) will be downloaded automatically from HuggingFace. This is a one-time download — subsequent runs load from disk instantly.

  • Windows: %USERPROFILE%\.cache\huggingface\hub\
  • macOS: ~/.cache/huggingface/hub/

Your reference audio

  • WAV format preferred. M4A and MP3 also work (ffmpeg handles conversion automatically).
  • Aim for at least 30–60 seconds of clean, natural speech.
  • Record in a quiet space — no background noise, no music.
  • Reading out Harvard sentences makes an excellent reference clip.
  • Record your voice and save it as sample-voice.wav in this same directory.

Scripts

voiceover.py — single generation

Good for one-off generations. Loads the model fresh each run.

Windows:

# Basic usage
uv run voiceover.py --ref sample-voice.wav --text "Welcome to this video"

# From a text file
uv run voiceover.py --ref sample-voice.wav --file script.txt

# Custom output filename
uv run voiceover.py --ref sample-voice.wav --text "Your script" --out intro.wav

# Tweak voice parameters
uv run voiceover.py --ref sample-voice.wav --text "Your script" --exaggeration 0.4 --cfg 0.7

macOS:

# Basic usage
./voiceover.py --ref sample-voice.wav --text "Welcome to this video"

# From a text file
./voiceover.py --ref sample-voice.wav --file script.txt

# Custom output filename
./voiceover.py --ref sample-voice.wav --text "Your script" --out intro.wav

# Tweak voice parameters
./voiceover.py --ref sample-voice.wav --text "Your script" --exaggeration 0.4 --cfg 0.7

voiceover_server.py — persistent server (recommended for sessions)

Loads the model once and keeps it warm. Use this when generating multiple voiceovers in one sitting — each generation is much faster since the model stays in memory.

Terminal 1 — start the server:

Windows:

uv run voiceover_server.py --ref sample-voice.wav

macOS:

./voiceover_server.py --ref sample-voice.wav

Terminal 2 — generate as many times as you want:

# Basic
curl -X POST http://localhost:8765 -d "Welcome to this video"

# Custom output filename
curl -X POST "http://localhost:8765?out=intro.wav" -d "Your full script here..."

# Tweak parameters per request
curl -X POST "http://localhost:8765?exaggeration=0.4&cfg=0.7" -d "Your script"

curl is available natively on both macOS and Windows 10/11.

Output files are saved in your current directory, auto-named voiceover_001.wav, voiceover_002.wav etc. unless you specify ?out=filename.wav.

Long scripts are automatically split into sentence-sized chunks, generated separately, and stitched into one seamless output file.

Stop the server: Ctrl+C in Terminal 1.


Parameters

Parameter Default Description
--exaggeration 0.5 Expressiveness. Lower = calmer, higher = more animated. Try 0.30.7.
--cfg 0.5 How closely the output follows your reference voice. Higher = more like you. Try 0.60.7 if it sounds too generic.

Both parameters can also be passed per-request to the server via query string: ?exaggeration=0.4&cfg=0.7


Convert output to MP3 or MP4

These commands work on both macOS and Windows (requires ffmpeg installed):

# WAV → MP3
ffmpeg -i voiceover.wav voiceover.mp3

# WAV → MP4 (black screen video, useful for LinkedIn etc.)
ffmpeg -i voiceover.wav -f lavfi -i color=c=black:s=1280x720:r=24 -shortest -c:v libx264 -c:a aac voiceover.mp4

Notes

  • Device (Windows): If you have an NVIDIA GPU, PyTorch may use CUDA automatically for faster generation. CPU fallback works fine otherwise.
  • Device (macOS): Chatterbox runs on CPU on Apple Silicon. MPS (Apple's GPU backend) is not used due to a PyTorch conv1d limitation at this model's output size. CPU on an M-series Mac is fast enough for this use case.
  • Watermark: Generated audio includes an imperceptible neural watermark from Resemble AI (Perth watermarker). It does not affect audio quality and is not detected or flagged by YouTube or other platforms.
  • Model cache: Safe to delete to free disk space — it will re-download on next run. Windows: %USERPROFILE%\.cache\huggingface\hub\. macOS: ~/.cache/huggingface/hub/.
  • Python version: Pinned to 3.11 in the script header. uv handles this automatically.

About

Generate voiceovers in your own voice, locally, using Chatterbox for free

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages