xAI STT lên sóng trên LiveKit Inference: cascaded voice pipeline chỉ với 1 API key

TL;DR

LiveKit vừa công bố xAI STT đã live trên LiveKit Inference. Giờ dev có thể chạy một cascaded voice agent pipeline hoàn chỉnh — STT → Grok LLM → TTS — chỉ với một API key duy nhất của LiveKit. Đi kèm: billing thống nhất, concurrency dashboard chung, hot-swap component bằng cách đổi string, và low-latency nhờ private backbone của LiveKit.

Đáng chú ý hơn: Grok STT claim 5.0% error rate trên phone call entity recognition — bỏ xa ElevenLabs (12.0%), Deepgram (13.5%), AssemblyAI (21.3%). Và với $0.10/giờ batch, $0.20/giờ streaming, nó rẻ và cạnh tranh gắt với thị trường enterprise STT.

What's new

Tin từ LiveKit trên X: "xAI STT is live. You can now run a complete cascaded voice agent pipeline on xAI (STT + Grok + TTS) through LiveKit Inference with one API key."

Bối cảnh: ngày 18/04/2026, xAI đã GA hai standalone API — Grok STT và Grok TTS — trên cùng production stack đang chạy Grok Voice trên mobile app, Tesla và Starlink support. LiveKit nhanh chóng bổ sung chúng vào Inference, biến mình thành one-stop shop cho toàn bộ voice stack của xAI: Grok LLM + Grok TTS + Grok STT + Grok Voice Agent API (speech-to-speech).

Why it matters

Cascaded pipeline (VAD → STT → LLM → TTS) vẫn là kiến trúc production phổ biến nhất cho voice agent — linh hoạt, debug được, mix-and-match được provider. Nhưng ghép chúng lại là ác mộng: mỗi nhà cung cấp một API key, một billing, một cách handle rate limit (STT tính WebSocket concurrent, LLM tính tokens/min, TTS tính concurrent generation).

LiveKit Inference gom tất cả lại:

Một key cho STT + LLM + TTS
API thống nhất — đổi TTS voice chỉ cần sửa string: tts="xai/grok-tts:eve"
Concurrency theo model type, không theo provider — switch OpenAI ↔ Gemini không cần đàm phán lại quota
Global co-location: agent chạy cùng data center với inference service, đi qua private backbone
Provisioned capacity với từng provider để né public endpoint nghẽn
Dynamic routing (sắp ra): auto-reroute khi detect region/provider chậm

Hệ quả: dev tập trung vào sản phẩm voice, không chôn thời gian vào "undifferentiated infrastructure work" (trích từ LiveKit).

Technical facts

Component	Property	Value
Grok STT	Languages	25
Grok STT	Modes	Batch + streaming
Grok STT	Pricing	$0.10/hr batch · $0.20/hr streaming
Grok STT	Max file size	500 MB / request
Grok STT	Audio formats	12 (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV, PCM, µ-law, A-law)
Grok STT	Features	Diarization, word-level timestamps, multichannel, Inverse Text Normalization
Grok TTS	Languages	20
Grok TTS	Voices	5 (Ara, Eve, Leo, Rex, Sal — Eve default)
Grok TTS	Pricing	$4.20 / 1M characters
Grok TTS	REST limit	15,000 chars/req · WebSocket không giới hạn
LiveKit Inference	SDKs	Python + Node.js
LiveKit Inference	Dashboard	Unified concurrency limits theo model type

Grok TTS hỗ trợ inline tags như [laugh], [sigh], [breath] và wrapping tags <whisper>…</whisper>, <emphasis>…</emphasis> — giúp output nghe tự nhiên thay vì đều đều như TTS truyền thống.

Comparison — STT benchmarks

Trên phone call entity recognition (nhận dạng tên, số tài khoản, ngày tháng qua điện thoại — nhiệm vụ khó nhất trong STT), Grok bỏ xa cả 3 đối thủ lớn:

Provider	Error rate
Grok STT	5.0%
ElevenLabs	12.0%
Deepgram	13.5%
AssemblyAI	21.3%

Trên video/podcast transcription, Grok và ElevenLabs đồng hạng nhất ở 2.4%. Word error rate tổng quát: 6.9%. Các số này do xAI công bố — cần verify trên workload thực tế, nhưng đã là tuyên bố táo bạo, đặc biệt cho vertical medical/legal/financial nơi entity recognition quyết định chất lượng.

Nếu đang cân nhắc route integrated speech-to-speech thay vì cascaded, Grok Voice Agent API ra giá $0.05/phút — rẻ hơn Deepgram ($0.08), ElevenLabs ($0.09), OpenAI Realtime (~$0.10), Bland ($0.14). Latency end-to-end <700ms, time-to-first-audio <1s, ranking #1 trên Big Bench Audio.

Use cases

Call center & customer support — cùng stack đang phục vụ Tesla và Starlink support, cộng thêm entity accuracy 5% là vũ khí mạnh
Healthcare & therapy — expressive tags + HIPAA-eligible BAA cho intake, coaching, companion
Education & tutoring — code-switching multilingual phù hợp app học ngôn ngữ
Sales & recruiting — outbound qualification, screening interview
In-car & edge — Tesla là design partner; tool calling + real-time X/web search
Meeting / podcast transcription — diarization + word timestamps + ITN cho currency, dates

Limitations & pricing

STT: max 500 MB/request; chỉ 12 audio format được hỗ trợ
TTS REST: 15,000 chars/request — dài hơn phải dùng WebSocket streaming
Grok Voice Agent API plugin trên LiveKit hiện chỉ có Python; Node.js đang trong roadmap
Docs xAI còn mỏng ở mảng production (SIP/PSTN thuần, security, monitoring) — LiveKit bù phần lớn khoảng trống này

Pricing nhanh: STT $0.10/$0.20 per hour · TTS $4.20 per 1M chars · Voice Agent $0.05/min. LiveKit Inference đi kèm free monthly credits trên mọi plan Cloud.

What's next

LiveKit Inference — dynamic routing: monitor latency real-time theo region & provider, auto-reroute khi outage
LiveKit xAI plugin — Node.js: ra mắt sau Python
xAI audio models v2: cải thiện pronunciation và latency

Với STT GA + TTS GA + Voice Agent API GA + LiveKit Inference tích hợp đủ bộ, xAI đang tạo sức ép rõ rệt lên OpenAI Realtime API và Gemini Live ở mảng real-time conversational voice. Còn LiveKit tiếp tục khẳng định vai trò "picks and shovels" của cơn sốt voice AI — không training model, nhưng là đường ống bắt buộc phải đi qua.

Nguồn: LiveKit Blog, xAI News, LiveKit Inference, MarkTechPost, LiveKit Docs.

xAI STT lên sóng trên LiveKit Inference: cascaded voice pipeline chỉ với 1 API key

TL;DR

What's new

Why it matters

Technical facts

Comparison — STT benchmarks

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

Grok Voice Think Fast 1.0 quét sạch τ-voice Bench: 67.3% overall, 73.7% telecom — bỏ xa OpenAI và Google

Lightning TTS + Pulse STT giờ chạy native trên Pipecat — cắt 200ms khỏi voice agent

OpenAI's gpt-realtime-1.5 opens a 20+ point lead on Sierra's voice leaderboard

Soniox ra mắt Text-to-Speech: $0.70/giờ, 60+ ngôn ngữ, hallucination-free