Local Offline Text-to-Speech and Voice Cloning

v20260423

local-tts

This advanced skill generates high-quality, natural speech from text entirely offline using the VoxCPM2 model. Supporting 30 languages, it offers three modes: standard text-to-speech, descriptive voice design, and advanced voice cloning from reference audio. All processing runs on the local device (optimized for Apple Silicon), ensuring zero API calls, zero cost, and maximum privacy. Ideal for creating professional voiceovers, podcasts, and video narration without internet dependency.

Text-to-Speech Offline Voice-Cloning Voice-Design Audio Narration Apple-Silicon AI

Get Skill

405 downloads

Overview

Local TTS — Offline Text-to-Speech

Generate speech from text using VoxCPM2 locally. 30 languages, voice design, voice cloning. Runs on Apple Silicon via Metal. Apache-2.0, zero cost.

Overview

This skill wraps VoxCPM2 (OpenBMB, Apache-2.0) for local text-to-speech. It supports three modes:

Default voice — just feed text, get natural speech in 30 languages (auto-detected)
Voice Design — describe the voice in a parenthetical prefix, get matching speech
Voice Cloning — provide a 3-10s reference clip, the output mimics the voice

All processing happens on-device. No API keys. No network calls after the initial model download. Output is 48 kHz WAV ready for any use (Telegram voice messages, podcasts, video narration).

Prerequisites

Python 3.10+ (3.12 recommended)
macOS with Apple Silicon preferred (M1/M2/M3/M4). Linux with CUDA also works.
~10 GB disk space for model weights (downloaded once on first use)
~16 GB RAM recommended

The skill expects a Python venv at ~/.local-tts/venv with the voxcpm package installed. If missing, create it:

mkdir -p ~/.local-tts
python3.12 -m venv ~/.local-tts/venv
~/.local-tts/venv/bin/pip install --upgrade pip voxcpm

First generation downloads ~10 GB of model weights to ~/.cache/huggingface/. Subsequent runs load the cache in ~30s.

Instructions

Step 1 — Verify the environment

ls ~/.local-tts/venv/bin/python && echo "venv OK" || echo "Run setup first"

If the venv is missing, guide the user through the setup commands above.

Step 2 — Generate the speech

Use the generate.py script bundled in this plugin. The entry point:

VENV=~/.local-tts/venv
SCRIPT=${CLAUDE_PLUGIN_ROOT}/scripts/generate.py
OUT=/tmp/tts_$(date +%s).wav

Default voice (auto-detected language):

"$VENV/bin/python" "$SCRIPT" --text "Your text here." --out "$OUT"

Voice Design — describe the voice in parentheses at the start. The parenthetical is stripped from the spoken audio.

"$VENV/bin/python" "$SCRIPT" \
  --text "(warm female voice, mid-30s, American accent)Welcome back." \
  --out "$OUT"

Description examples that work:

(young woman, gentle and sweet voice)
(older man, deep resonant voice, slow pace)
(cheerful, energetic, fast-talking)
(voix féminine chaleureuse, ton posé) — descriptions in any supported language

Voice Cloning — provide a reference clip (3-10s). Clones timbre, accent, emotional tone.

"$VENV/bin/python" "$SCRIPT" \
  --text "This is the cloned voice speaking." \
  --ref /path/to/reference.wav \
  --out "$OUT"

Ultimate Cloning — reference + prompt for maximum fidelity (reproduces micro-level vocal nuances):

"$VENV/bin/python" "$SCRIPT" \
  --text "Highest fidelity clone." \
  --ref /path/to/ref.wav \
  --prompt-wav /path/to/ref.wav \
  --out "$OUT"

Long text via stdin (for articles, scripts):

cat /path/to/article.txt | "$VENV/bin/python" "$SCRIPT" --stdin --out "$OUT"

Step 3 — Verify and hand off

file "$OUT"   # Should show: "RIFF ... WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz"
ls -lh "$OUT" # Check size is reasonable

The script prints OK <duration>s <rtf>x <path> on success.

Output

Format: 48 kHz mono WAV, 16-bit PCM
Location: whatever --out path specified (typically /tmp/tts_*.wav)
Size: roughly 100 KB per second of audio
Usage: ready to attach to Telegram, embed in video, use as voiceover

Script options

Flag	Purpose
`--text STR`	Text to synthesize
`--stdin`	Read text from stdin (for long input)
`--out PATH`	Output WAV path (required)
`--ref PATH`	Reference audio for cloning
`--prompt-wav PATH`	Prompt wav for ultimate cloning
`--cfg FLOAT`	Classifier-free guidance (default 2.0)
`--steps INT`	Diffusion steps (default 10)
`--model ID`	Model id (default `openbmb/VoxCPM2`)
`--quiet`	Suppress loading messages

Supported languages (30)

Arabic, Burmese, Chinese (+ dialects), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese.

No language tag needed — VoxCPM auto-detects from the text.

Error Handling

ModuleNotFoundError: voxcpm — venv missing. Run the setup commands from Prerequisites.
No such file: VoxCPM2 weights — HuggingFace cache missing. First run will download (needs network, ~10 GB).
Slow first call (~5 min) — normal. Model download + initial load. Subsequent runs ~30s.
French pronunciation edge cases — add an IPA-ish hint or rephrase. Most names and proper nouns work out of the box.

Performance

On Apple M4 with MPS + bfloat16:

First load: ~340s (downloads weights)
Subsequent loads: ~30s
Generation: ~2.3× realtime (10s audio ≈ 23s compute)

Not suitable for real-time streaming. Good for batch generation, voiceovers, podcasts, voice messages.

Examples

Example 1: Voice message for Telegram

"$VENV/bin/python" "$SCRIPT" \
  --text "Hey, quick voice note about our meeting tomorrow." \
  --out /tmp/voice_msg.wav

Example 2: Clone a voice from an MP3

"$VENV/bin/python" "$SCRIPT" \
  --text "Bonjour, c'est une voix clonée localement." \
  --ref ~/my_voice_sample.mp3 \
  --out /tmp/cloned.wav

Example 3: Designed voice for narration

"$VENV/bin/python" "$SCRIPT" \
  --text "(deep narrator voice, dramatic, slow pace)In a world where AI runs locally..." \
  --out /tmp/narration.wav

Resources

VoxCPM2 source: https://github.com/OpenBMB/VoxCPM
Model card: https://huggingface.co/openbmb/VoxCPM2
Script source (same as bundled): https://github.com/vdk888/local-tts/blob/main/scripts/generate.py
License: Apache-2.0

Info

Category Productivity

Name local-tts

Version v20260423

Size 6.64KB

Source jeremylongshore/claude-code-plugins-plus-skills

Updated At 2026-04-28