Generate speech from text using VoxCPM2 locally. 30 languages, voice design, voice cloning. Runs on Apple Silicon via Metal. Apache-2.0, zero cost.
This skill wraps VoxCPM2 (OpenBMB, Apache-2.0) for local text-to-speech. It supports three modes:
All processing happens on-device. No API keys. No network calls after the initial model download. Output is 48 kHz WAV ready for any use (Telegram voice messages, podcasts, video narration).
The skill expects a Python venv at ~/.local-tts/venv with the voxcpm package installed. If missing, create it:
mkdir -p ~/.local-tts
python3.12 -m venv ~/.local-tts/venv
~/.local-tts/venv/bin/pip install --upgrade pip voxcpm
First generation downloads ~10 GB of model weights to ~/.cache/huggingface/. Subsequent runs load the cache in ~30s.
ls ~/.local-tts/venv/bin/python && echo "venv OK" || echo "Run setup first"
If the venv is missing, guide the user through the setup commands above.
Use the generate.py script bundled in this plugin. The entry point:
VENV=~/.local-tts/venv
SCRIPT=${CLAUDE_PLUGIN_ROOT}/scripts/generate.py
OUT=/tmp/tts_$(date +%s).wav
Default voice (auto-detected language):
"$VENV/bin/python" "$SCRIPT" --text "Your text here." --out "$OUT"
Voice Design — describe the voice in parentheses at the start. The parenthetical is stripped from the spoken audio.
"$VENV/bin/python" "$SCRIPT" \
--text "(warm female voice, mid-30s, American accent)Welcome back." \
--out "$OUT"
Description examples that work:
(young woman, gentle and sweet voice)
(older man, deep resonant voice, slow pace)
(cheerful, energetic, fast-talking)
(voix féminine chaleureuse, ton posé) — descriptions in any supported languageVoice Cloning — provide a reference clip (3-10s). Clones timbre, accent, emotional tone.
"$VENV/bin/python" "$SCRIPT" \
--text "This is the cloned voice speaking." \
--ref /path/to/reference.wav \
--out "$OUT"
Ultimate Cloning — reference + prompt for maximum fidelity (reproduces micro-level vocal nuances):
"$VENV/bin/python" "$SCRIPT" \
--text "Highest fidelity clone." \
--ref /path/to/ref.wav \
--prompt-wav /path/to/ref.wav \
--out "$OUT"
Long text via stdin (for articles, scripts):
cat /path/to/article.txt | "$VENV/bin/python" "$SCRIPT" --stdin --out "$OUT"
file "$OUT" # Should show: "RIFF ... WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz"
ls -lh "$OUT" # Check size is reasonable
The script prints OK <duration>s <rtf>x <path> on success.
--out path specified (typically /tmp/tts_*.wav)| Flag | Purpose |
|---|---|
--text STR |
Text to synthesize |
--stdin |
Read text from stdin (for long input) |
--out PATH |
Output WAV path (required) |
--ref PATH |
Reference audio for cloning |
--prompt-wav PATH |
Prompt wav for ultimate cloning |
--cfg FLOAT |
Classifier-free guidance (default 2.0) |
--steps INT |
Diffusion steps (default 10) |
--model ID |
Model id (default openbmb/VoxCPM2) |
--quiet |
Suppress loading messages |
Arabic, Burmese, Chinese (+ dialects), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese.
No language tag needed — VoxCPM auto-detects from the text.
ModuleNotFoundError: voxcpm — venv missing. Run the setup commands from Prerequisites.No such file: VoxCPM2 weights — HuggingFace cache missing. First run will download (needs network, ~10 GB).On Apple M4 with MPS + bfloat16:
Not suitable for real-time streaming. Good for batch generation, voiceovers, podcasts, voice messages.
Example 1: Voice message for Telegram
"$VENV/bin/python" "$SCRIPT" \
--text "Hey, quick voice note about our meeting tomorrow." \
--out /tmp/voice_msg.wav
Example 2: Clone a voice from an MP3
"$VENV/bin/python" "$SCRIPT" \
--text "Bonjour, c'est une voix clonée localement." \
--ref ~/my_voice_sample.mp3 \
--out /tmp/cloned.wav
Example 3: Designed voice for narration
"$VENV/bin/python" "$SCRIPT" \
--text "(deep narrator voice, dramatic, slow pace)In a world where AI runs locally..." \
--out /tmp/narration.wav