TL;DR

Voicebox is a free, open-source desktop app by Jamie Pine that turns your laptop into a full voice-AI studio. Clone any voice from 3 seconds of audio, generate speech in 23 languages across 7 TTS engines, and drive everything through a local REST API at http://127.0.0.1:17493. The repo just crossed 23k stars with v0.4.5 shipping two days ago, and it's positioned as a direct, MIT-licensed replacement for ElevenLabs — minus the subscriptions, rate limits, and cloud dependency.

Voicebox desktop app — voice profiles and generation panel

What's new

Voicebox bundles seven state-of-the-art open-source TTS models behind one Tauri (Rust) desktop client. You can switch engines per generation, mix multiple voices on a timeline, and ship voice lines through a localhost API — all without leaking a single byte to a third-party server.

  • Zero-shot voice cloning from a 3-second sample (upload, mic, or system audio capture from a YouTube clip).
  • 50+ preset voices via Kokoro and 9 premium speakers via Qwen CustomVoice for those who don't want to clone.
  • Stories editor — multi-track timeline for podcasts, NPC dialogue, and multi-character narration.
  • Built-in REST API on port 17493 with OpenAPI docs — the real unlock for AI agents and games.
  • Whisper transcription baked in, so reference text is auto-extracted from voice samples.

Why it matters

Commercial voice tools like ElevenLabs gate the good stuff behind subscriptions and per-character meters. That's fine for a one-off ad read, brutal for an audiobook, an indie game with 5,000 NPC lines, or an AI agent that talks all day. Voicebox flips the economics: download once, generate forever, on your own GPU.

The privacy angle is just as load-bearing. Unreleased game scripts, internal training narration, and confidential client voiceovers never leave the machine. Qwen team benchmarks even show their 1.7B model beating ElevenLabs in Word Error Rate on Chinese, English, Italian, and Spanish — so the “open-source = lower quality” assumption no longer holds.

Technical facts

The seven engines are not interchangeable — each has a niche. Pick the one that fits the workload:

EngineSizeStrength
Qwen3-TTS0.6B / 1.7BHighest-quality multilingual cloning, 10 languages
Qwen CustomVoice0.6B / 1.7B9 preset voices, natural-language delivery (“warm, slow, cinematic”)
Chatterbox MultilingualBroadest coverage: 23 languages
Chatterbox Turbo350MOnly engine that interprets [laugh], [sigh], [gasp] tags
HumeAI TADA1B / 3BLong-form coherent audio, 700s+ without drift
LuxTTS150x realtime on CPU, ~1GB VRAM, 48kHz output
Kokoro82MTiny, CPU-realtime, 50 preset voices

Other hard numbers worth knowing: max 50,000 characters per generation (auto-chunked at sentence boundaries with a 0–200ms crossfade), 8 audio effects via Spotify's Pedalboard library, and a model footprint of 2–4 GB for Qwen3-TTS that auto-downloads from HuggingFace on first use.

Comparison vs ElevenLabs

DimensionVoiceboxElevenLabs
Cost$0 forever, MIT licenseSubscription + per-character meter
Privacy100% local, no network after model downloadCloud upload required
Quality (WER)Qwen3-TTS 1.7B beats EL on zh/en/it/esSlight edge on de/pt
Real-time streamingNot yet (on roadmap)Yes
HardwareYour GPU (or CPU)Their cloud cluster
Expressive controlNatural-language instruct + paralinguistic tagsLimited preset emotions

Use cases

Voicebox Stories editor — multi-track timeline

  • Game devs: generate NPC dialogue on the fly, localize characters into 23 languages, ship expressive lines without booking a studio.
  • AI agents: POST to /generate on localhost and your agent has a voice — zero per-character cost, zero rate limits, runs on the user's machine.
  • Podcasters: use the Stories timeline to mix multi-character conversations from a single keyboard.
  • Audiobook authors: batch-generate up to 50,000 chars per run with auto-chunking and crossfade.
  • Accessibility devs: ship offline screen-readers that use a familiar voice, no network required.

For developers, the killer detail is that Voicebox is “just a localhost URL.” A minimal generation call:

curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text":"Welcome to the game, player one.","profile_id":"...","engine":"qwen_custom_voice","instruct":"warm, slow, cinematic"}' \
  --output line.wav

Limitations & pricing

Pricing is the easy part: free, MIT license, forever. The trade-offs sit elsewhere.

  • No real-time streaming yet. Audio is generated chunk-by-chunk; word-by-word streaming is on the roadmap.
  • Linux: no prebuilt binaries (blocked by GitHub Actions disk-space limits) — build from source for now. macOS (Apple Silicon + Intel), Windows MSI, and Docker Compose are first-class.
  • CPU-only inference on Windows/Intel is noticeably slow. NVIDIA CUDA or Apple MLX recommended.
  • Paralinguistic tags ([laugh], [sigh]) only work in Chatterbox Turbo — other engines read them literally.
  • No mobile companion app and no live conversation mode — both planned.

Voicebox clone-voice dialog — record a 30s sample

What's next

The roadmap is unusually crisp: real-time streaming, voice design from text-only descriptions, live conversation mode, mobile remote control, a plugin architecture for custom models, and XTTS + Bark engine support. Linux prebuilt binaries land as soon as the CI disk-space issue is unblocked. With 24 releases shipped in roughly three months, the velocity is real.

If you've been waiting for a moment to stop paying per-character for synthetic voice, this is it. Grab the installer from voicebox.sh, point your AI agent at http://127.0.0.1:17493, and own your voice stack.

Sources: github.com/jamiepine/voicebox, voicebox.sh, QwenLM/Qwen3-TTS, scriptbyai.