- Google's new Gemini 3.1 Flash TTS ships with 200+ inline audio tags, 30 voices, and 70+ languages.
- Here's what the tags do, how to prompt them, and where the model actually fits.
TL;DR
On April 15, 2026, Google launched Gemini 3.1 Flash TTS — a preview text-to-speech model that turns plain transcripts into directed performances. Instead of picking a voice and praying the prosody lands, you embed [whispers], [slow], [awe], [short pause] tags right inside the text and the model follows. 200+ audio tags, 30 prebuilt voices, 70+ languages, Elo 1,211 on the Artificial Analysis TTS leaderboard, and SynthID watermarking baked in. Available now in the Gemini API, AI Studio, Vertex AI, and Google Vids.

What's new
The headline feature is inline audio tags — square-bracketed natural-language directives that sit directly in the transcript and change how specific words or phrases are spoken. No SSML. No post-processing. No separate API call for each emotion shift.
The Google Cloud team describes the core formula as:
[pacing tag] + spoken text + [expressive tag] + spoken text + [pause tag] + spoken text
A real prompt looks like this:
[encouraging] Let's try that last sentence again to make sure that you nailed it. [slow] "L'oiseau s'est envolé." [short pause] Perfect! [laughs] You're a natural.Three tag families do most of the work:
- Expressive:
[determination],[enthusiasm],[awe],[nervousness],[curiosity],[excitement],[confusion],[cheerful],[urgent],[calm],[serious]. - Pacing:
[slow],[fast],[short pause],[long pause]. - Vocalization:
[whispers],[laughs],[cackles],[gasp].
Style tags — [newscast], [documentary], [conversational], [formal] — shift an entire register rather than a single phrase. And helpfully, tags stay in English even when the spoken transcript is French, Japanese, or Arabic — one control layer across 70+ languages.
Why it matters
Traditional TTS pipelines handle emotion outside the script: pick a voice preset, maybe wrap chunks in SSML, hope it sounds right. If a line needed to pivot from calm to urgent mid-sentence, you either re-synthesized with a different preset or edited the waveform.
Flash TTS collapses that workflow. The transcript is the direction. A bank fraud alert can carry [neutral] into [seriousness] into [positive] inside one utterance. An audiobook narrator can add a [short pause] before the twist and a [whispers] after. Content teams get a friendlier mental model: you're not engineering output parameters, you're directing performance.
Technical facts
| Property | Value |
|---|---|
| Model ID | gemini-3.1-flash-tts-preview |
| Audio tags | 200+ |
| Prebuilt voices | 30 (Achernar, Aoede, Kore, Puck, Umbriel, Zephyr, Gacrux, and 23 others) |
| Languages | 70+ with regional variants |
| Input token limit | 8,192 |
| Output token limit | 16,384 |
| Context window | 32k tokens per session |
| Audio token rate | 25 tokens = 1 second of audio |
| Leaderboard Elo | 1,211 (Artificial Analysis TTS, blind preference) |
| Multi-speaker | Native, up to 2 speakers via MultiSpeakerVoiceConfig |
| Watermark | SynthID embedded in every output |
| GCP preview pricing | $1.00 / 1M input text tokens, $20.00 / 1M output audio tokens |
Comparison vs ElevenLabs, Mistral, OpenAI
| Capability | Gemini 3.1 Flash TTS | ElevenLabs | Mistral open-weight | OpenAI TTS |
|---|---|---|---|---|
| Inline per-sentence control | Yes — 200+ tags | Account-level only | No | No |
| Voice cloning | No | Yes | Yes | Limited |
| Languages | 70+ | Many | Fewer | ~10 |
| Local deployment | No | No | Yes | No |
| Real-time <100ms | No (use Flash Live) | Partial | No | Partial |
| Watermark | SynthID | — | — | — |
If you need voice cloning, ElevenLabs still wins. If you need on-prem, Mistral's open-weight model is the pick. If you need dynamic expressive control inside a generated script — mid-sentence emotion shifts without juggling API calls — Flash TTS is the clearer fit.
Use cases that actually benefit
- Accessibility & AAC: screen readers and assistive tech get pacing and prosody that reduce cognitive load over long sessions.
- IVR & notifications: bank fraud alerts, flight delays, delivery updates — tone pivots inside a single message.
- Audiobooks & e-learning: chapter-level pacing, suspense beats, multi-speaker dialogue from one script.
- Voice agents (scripted layer): pair Flash TTS with Gemini 3.1 Flash Live — Live handles real-time conversation, TTS handles pre-generated narration and confirmations.
- Multilingual marketing: one English-tagged script, 70+ locales, consistent tone across markets.
Limitations & pricing
Preview means rough edges. Know these before shipping to production:
- No voice cloning — curated voices only.
- Cloud-only — no local inference; data residency teams should check Vertex AI regions.
- Not real-time — batch content, not sub-100ms streaming. Use Flash Live for live agents.
- Long-form drift — quality wobbles past a few minutes. Chunk by chapter or section.
- Prompt sensitivity — vague prompts can trigger
PROHIBITED_CONTENTfalse-reject, or the model may read your director's notes out loud. Use a clear preamble and label the transcript section. - Tag parsing fragility — two adjacent tags produce unexpected results; validate LLM-generated annotations.
- Transient 500s — occasionally returns text tokens instead of audio. Implement retry logic.
Pricing on Google Cloud Text-to-Speech preview: $1.00 per 1M input text tokens and $20.00 per 1M output audio tokens, with audio measured at 25 tokens per second. The Gemini Developer API offers a free preview tier plus lower batch rates — verify before launch, preview pricing shifts.
What's next
Flash TTS is the third drop in the 3.1 audio family this spring. Flash-Lite shipped March 3 for low-latency, high-volume generation. Flash Live shipped March 26 with bidirectional streaming and interruption handling for real-time voice agents. Flash TTS (April 15) rounds out the scripted side.
What to watch: GA graduation, more Vertex AI regions, potential voice customization features, and tighter stitching between Flash TTS (batch narration) and Flash Live (real-time turns) so one product can cover both sides of a voice app.
If you build voice products, the interesting question isn't "is this better than ElevenLabs?" It's "what can I ship now that my pipeline supports mid-sentence direction?" Answer that and the model's value becomes obvious.
Sources: blog.google, Google Cloud Blog, Gemini API docs, Google DeepMind.

