本地离线文本转语音与声音克隆

v20260423

local-tts

本技能基于VoxCPM2模型实现本地离线文本转语音合成。支持30种语言，提供三种先进模式：标准语音合成、描述式声音设计，以及从用户提供的参考音频进行声音克隆。所有处理均在本地设备上完成，无需任何API调用或网络连接，非常适合制作高质量的视频旁白、播客音频或语音消息。

文本转语音离线声音克隆声音设计音频旁白苹果芯片人工智能

获取技能

405 次下载

概览

Local TTS — Offline Text-to-Speech

Generate speech from text using VoxCPM2 locally. 30 languages, voice design, voice cloning. Runs on Apple Silicon via Metal. Apache-2.0, zero cost.

Overview

This skill wraps VoxCPM2 (OpenBMB, Apache-2.0) for local text-to-speech. It supports three modes:

Default voice — just feed text, get natural speech in 30 languages (auto-detected)
Voice Design — describe the voice in a parenthetical prefix, get matching speech
Voice Cloning — provide a 3-10s reference clip, the output mimics the voice

All processing happens on-device. No API keys. No network calls after the initial model download. Output is 48 kHz WAV ready for any use (Telegram voice messages, podcasts, video narration).

Prerequisites

Python 3.10+ (3.12 recommended)
macOS with Apple Silicon preferred (M1/M2/M3/M4). Linux with CUDA also works.
~10 GB disk space for model weights (downloaded once on first use)
~16 GB RAM recommended

The skill expects a Python venv at ~/.local-tts/venv with the voxcpm package installed. If missing, create it:

mkdir -p ~/.local-tts
python3.12 -m venv ~/.local-tts/venv
~/.local-tts/venv/bin/pip install --upgrade pip voxcpm

First generation downloads ~10 GB of model weights to ~/.cache/huggingface/. Subsequent runs load the cache in ~30s.

Instructions

Step 1 — Verify the environment

ls ~/.local-tts/venv/bin/python && echo "venv OK" || echo "Run setup first"

If the venv is missing, guide the user through the setup commands above.

Step 2 — Generate the speech

Use the generate.py script bundled in this plugin. The entry point:

VENV=~/.local-tts/venv
SCRIPT=${CLAUDE_PLUGIN_ROOT}/scripts/generate.py
OUT=/tmp/tts_$(date +%s).wav

Default voice (auto-detected language):

"$VENV/bin/python" "$SCRIPT" --text "Your text here." --out "$OUT"

Voice Design — describe the voice in parentheses at the start. The parenthetical is stripped from the spoken audio.

"$VENV/bin/python" "$SCRIPT" \
  --text "(warm female voice, mid-30s, American accent)Welcome back." \
  --out "$OUT"

Description examples that work:

(young woman, gentle and sweet voice)
(older man, deep resonant voice, slow pace)
(cheerful, energetic, fast-talking)
(voix féminine chaleureuse, ton posé) — descriptions in any supported language

Voice Cloning — provide a reference clip (3-10s). Clones timbre, accent, emotional tone.

"$VENV/bin/python" "$SCRIPT" \
  --text "This is the cloned voice speaking." \
  --ref /path/to/reference.wav \
  --out "$OUT"

Ultimate Cloning — reference + prompt for maximum fidelity (reproduces micro-level vocal nuances):

"$VENV/bin/python" "$SCRIPT" \
  --text "Highest fidelity clone." \
  --ref /path/to/ref.wav \
  --prompt-wav /path/to/ref.wav \
  --out "$OUT"

Long text via stdin (for articles, scripts):

cat /path/to/article.txt | "$VENV/bin/python" "$SCRIPT" --stdin --out "$OUT"

Step 3 — Verify and hand off

file "$OUT"   # Should show: "RIFF ... WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz"
ls -lh "$OUT" # Check size is reasonable

The script prints OK <duration>s <rtf>x <path> on success.

Output

Format: 48 kHz mono WAV, 16-bit PCM
Location: whatever --out path specified (typically /tmp/tts_*.wav)
Size: roughly 100 KB per second of audio
Usage: ready to attach to Telegram, embed in video, use as voiceover

Script options

Flag	Purpose
`--text STR`	Text to synthesize
`--stdin`	Read text from stdin (for long input)
`--out PATH`	Output WAV path (required)
`--ref PATH`	Reference audio for cloning
`--prompt-wav PATH`	Prompt wav for ultimate cloning
`--cfg FLOAT`	Classifier-free guidance (default 2.0)
`--steps INT`	Diffusion steps (default 10)
`--model ID`	Model id (default `openbmb/VoxCPM2`)
`--quiet`	Suppress loading messages

Supported languages (30)

Arabic, Burmese, Chinese (+ dialects), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese.

No language tag needed — VoxCPM auto-detects from the text.

Error Handling

ModuleNotFoundError: voxcpm — venv missing. Run the setup commands from Prerequisites.
No such file: VoxCPM2 weights — HuggingFace cache missing. First run will download (needs network, ~10 GB).
Slow first call (~5 min) — normal. Model download + initial load. Subsequent runs ~30s.
French pronunciation edge cases — add an IPA-ish hint or rephrase. Most names and proper nouns work out of the box.

Performance

On Apple M4 with MPS + bfloat16:

First load: ~340s (downloads weights)
Subsequent loads: ~30s
Generation: ~2.3× realtime (10s audio ≈ 23s compute)

Not suitable for real-time streaming. Good for batch generation, voiceovers, podcasts, voice messages.

Examples

Example 1: Voice message for Telegram

"$VENV/bin/python" "$SCRIPT" \
  --text "Hey, quick voice note about our meeting tomorrow." \
  --out /tmp/voice_msg.wav

Example 2: Clone a voice from an MP3

"$VENV/bin/python" "$SCRIPT" \
  --text "Bonjour, c'est une voix clonée localement." \
  --ref ~/my_voice_sample.mp3 \
  --out /tmp/cloned.wav

Example 3: Designed voice for narration

"$VENV/bin/python" "$SCRIPT" \
  --text "(deep narrator voice, dramatic, slow pace)In a world where AI runs locally..." \
  --out /tmp/narration.wav

Resources

VoxCPM2 source: https://github.com/OpenBMB/VoxCPM
Model card: https://huggingface.co/openbmb/VoxCPM2
Script source (same as bundled): https://github.com/vdk888/local-tts/blob/main/scripts/generate.py
License: Apache-2.0

信息

Category 效率工具

Name local-tts

版本 v20260423

大小 6.64KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-28