Gemini 3.1 Flash TTS: How Audio Tags Turn Text-to-Speech Into Directed Performance

TL;DR

On April 15, 2026, Google launched Gemini 3.1 Flash TTS — a preview text-to-speech model that turns plain transcripts into directed performances. Instead of picking a voice and praying the prosody lands, you embed [whispers], [slow], [awe], [short pause] tags right inside the text and the model follows. 200+ audio tags, 30 prebuilt voices, 70+ languages, Elo 1,211 on the Artificial Analysis TTS leaderboard, and SynthID watermarking baked in. Available now in the Gemini API, AI Studio, Vertex AI, and Google Vids.

Gemini 3.1 Flash TTS launch banner

What's new

The headline feature is inline audio tags — square-bracketed natural-language directives that sit directly in the transcript and change how specific words or phrases are spoken. No SSML. No post-processing. No separate API call for each emotion shift.

The Google Cloud team describes the core formula as:

[pacing tag] + spoken text + [expressive tag] + spoken text + [pause tag] + spoken text

A real prompt looks like this:

[encouraging] Let's try that last sentence again to make sure that you nailed it. [slow] "L'oiseau s'est envolé." [short pause] Perfect! [laughs] You're a natural.

Three tag families do most of the work:

Expressive: [determination], [enthusiasm], [awe], [nervousness], [curiosity], [excitement], [confusion], [cheerful], [urgent], [calm], [serious].
Pacing: [slow], [fast], [short pause], [long pause].
Vocalization: [whispers], [laughs], [cackles], [gasp].

Style tags — [newscast], [documentary], [conversational], [formal] — shift an entire register rather than a single phrase. And helpfully, tags stay in English even when the spoken transcript is French, Japanese, or Arabic — one control layer across 70+ languages.

Why it matters

Traditional TTS pipelines handle emotion outside the script: pick a voice preset, maybe wrap chunks in SSML, hope it sounds right. If a line needed to pivot from calm to urgent mid-sentence, you either re-synthesized with a different preset or edited the waveform.

Flash TTS collapses that workflow. The transcript is the direction. A bank fraud alert can carry [neutral] into [seriousness] into [positive] inside one utterance. An audiobook narrator can add a [short pause] before the twist and a [whispers] after. Content teams get a friendlier mental model: you're not engineering output parameters, you're directing performance.

Technical facts

Property	Value
Model ID	`gemini-3.1-flash-tts-preview`
Audio tags	200+
Prebuilt voices	30 (Achernar, Aoede, Kore, Puck, Umbriel, Zephyr, Gacrux, and 23 others)
Languages	70+ with regional variants
Input token limit	8,192
Output token limit	16,384
Context window	32k tokens per session
Audio token rate	25 tokens = 1 second of audio
Leaderboard Elo	1,211 (Artificial Analysis TTS, blind preference)
Multi-speaker	Native, up to 2 speakers via `MultiSpeakerVoiceConfig`
Watermark	SynthID embedded in every output
GCP preview pricing	$1.00 / 1M input text tokens, $20.00 / 1M output audio tokens

Comparison vs ElevenLabs, Mistral, OpenAI

Capability	Gemini 3.1 Flash TTS	ElevenLabs	Mistral open-weight	OpenAI TTS
Inline per-sentence control	Yes — 200+ tags	Account-level only	No	No
Voice cloning	No	Yes	Yes	Limited
Languages	70+	Many	Fewer	~10
Local deployment	No	No	Yes	No
Real-time <100ms	No (use Flash Live)	Partial	No	Partial
Watermark	SynthID	—	—	—

If you need voice cloning, ElevenLabs still wins. If you need on-prem, Mistral's open-weight model is the pick. If you need dynamic expressive control inside a generated script — mid-sentence emotion shifts without juggling API calls — Flash TTS is the clearer fit.

Use cases that actually benefit

Accessibility & AAC: screen readers and assistive tech get pacing and prosody that reduce cognitive load over long sessions.
IVR & notifications: bank fraud alerts, flight delays, delivery updates — tone pivots inside a single message.
Audiobooks & e-learning: chapter-level pacing, suspense beats, multi-speaker dialogue from one script.
Voice agents (scripted layer): pair Flash TTS with Gemini 3.1 Flash Live — Live handles real-time conversation, TTS handles pre-generated narration and confirmations.
Multilingual marketing: one English-tagged script, 70+ locales, consistent tone across markets.

Limitations & pricing

Preview means rough edges. Know these before shipping to production:

No voice cloning — curated voices only.
Cloud-only — no local inference; data residency teams should check Vertex AI regions.
Not real-time — batch content, not sub-100ms streaming. Use Flash Live for live agents.
Long-form drift — quality wobbles past a few minutes. Chunk by chapter or section.
Prompt sensitivity — vague prompts can trigger PROHIBITED_CONTENT false-reject, or the model may read your director's notes out loud. Use a clear preamble and label the transcript section.
Tag parsing fragility — two adjacent tags produce unexpected results; validate LLM-generated annotations.
Transient 500s — occasionally returns text tokens instead of audio. Implement retry logic.

Pricing on Google Cloud Text-to-Speech preview: $1.00 per 1M input text tokens and $20.00 per 1M output audio tokens, with audio measured at 25 tokens per second. The Gemini Developer API offers a free preview tier plus lower batch rates — verify before launch, preview pricing shifts.

What's next

Flash TTS is the third drop in the 3.1 audio family this spring. Flash-Lite shipped March 3 for low-latency, high-volume generation. Flash Live shipped March 26 with bidirectional streaming and interruption handling for real-time voice agents. Flash TTS (April 15) rounds out the scripted side.

What to watch: GA graduation, more Vertex AI regions, potential voice customization features, and tighter stitching between Flash TTS (batch narration) and Flash Live (real-time turns) so one product can cover both sides of a voice app.

If you build voice products, the interesting question isn't "is this better than ElevenLabs?" It's "what can I ship now that my pipeline supports mid-sentence direction?" Answer that and the model's value becomes obvious.

Sources: blog.google, Google Cloud Blog, Gemini API docs, Google DeepMind.

Gemini 3.1 Flash TTS: How Audio Tags Turn Text-to-Speech Into Directed Performance

TL;DR

What's new

Why it matters

Technical facts

Comparison vs ElevenLabs, Mistral, OpenAI

Use cases that actually benefit

Limitations & pricing

What's next

Tiếp tục lướt

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

Grok Voice Think Fast 1.0 quét sạch τ-voice Bench: 67.3% overall, 73.7% telecom — bỏ xa OpenAI và Google

Lightning TTS + Pulse STT giờ chạy native trên Pipecat — cắt 200ms khỏi voice agent

OpenAI's gpt-realtime-1.5 opens a 20+ point lead on Sierra's voice leaderboard

Soniox ra mắt Text-to-Speech: $0.70/giờ, 60+ ngôn ngữ, hallucination-free