- OpenAI engineer Eric (@veggie_eric) says the newest Realtime voice model is 20+ points ahead of the nearest competitor on Sierra's τ³-Bench — a "pretty massive upgrade" from the December OG, with big realism and tool-calling gains.
TL;DR
OpenAI's newest speech-to-speech model — gpt-realtime-1.5, released to the Realtime API on Feb 23, 2026 — now sits 20+ points ahead of the nearest competitor on Sierra’s τ³-Bench voice leaderboard, per a public claim from OpenAI researcher Eric (@veggie_eric). The jump from the December “OG” gpt-realtime is driven by significantly better realism and tool calling — exactly the two axes where voice agents collapse under realistic audio conditions.
What’s new
Sierra’s τ³-Bench (launched Mar 18, 2026) is the first serious full-duplex voice-agent benchmark. It runs customer-support simulations over real audio — with background noise, interruptions, and turn-taking — against the actual Realtime APIs from OpenAI, Google Gemini Live, and xAI Grok Voice. A 20+ point gap on that board is not a press-release flex; it’s the biggest single-release jump reported on the leaderboard to date.
Eric’s framing: “Pretty massive upgrade from our OG model from December, with significantly improved realism and tool calling capabilities.” Translation: the two hardest problems in production voice agents — sounding human and reliably calling APIs mid-conversation — both moved at the same time.
Worth noting why this is a big deal architecturally. Realism and tool calling usually trade off against each other in speech-to-speech models. More expressive audio generation tends to burn compute that could have gone to structured reasoning; tighter tool-calling traces tend to flatten prosody. Shipping both in the same release suggests OpenAI found architectural headroom — likely in the decoder and function-routing layers — rather than pulling one lever at the expense of the other.
Why it matters
Voice agents have been stuck in an awkward middle ground. Sierra’s own numbers tell the story: text agents with reasoning hit ~85% task completion, plain text agents hit ~54%, but voice agents under realistic audio were stuck at 26–38%. The bottleneck wasn’t raw intelligence — it was authentication. Mishear a name or email once, and every downstream tool call fails.
A 20+ point leap directly attacks that failure mode. Better realism means fewer mishears. Better tool calling means fewer dropped CRM writes, fewer broken bookings, fewer failed payment confirmations. That’s the difference between a demo and a product.
Technical facts
| Property | Value |
|---|---|
| Model ID | gpt-realtime-1.5 |
| Released | Feb 23, 2026 (Realtime API) |
| Prior baseline | gpt-realtime (Dec 2024) & gpt-realtime-mini-2025-12-15 |
| Sierra τ³-Bench voice gap | 20+ pts ahead of nearest competitor |
| Context window | 32,000 tokens |
| Max output | 4,096 tokens |
| Modalities | Text, Image (in), Audio (in/out) |
| Function calling | Supported |
| Audio input | $32 / 1M tokens |
| Audio output | $64 / 1M tokens |
| Cached audio input | $0.40 / 1M tokens |
For context on the trajectory: the Dec 15, 2025 gpt-realtime-mini snapshot alone delivered +18.6 percentage points in instruction-following and +12.9 pp in tool-calling accuracy over its predecessor. gpt-realtime-1.5 is the next rung on that same ladder — now on the full-size model.
Comparison: the voice-provider field
| Provider | Model | τ³-Bench voice position |
|---|---|---|
| OpenAI | gpt-realtime-1.5 | Leader — 20+ pt gap |
| Gemini Live | Within the 26–38% realistic-voice band | |
| xAI | Grok Voice | Within the 26–38% realistic-voice band |
Before gpt-realtime-1.5, Sierra described the three providers as “closely matched.” That framing is now obsolete.
Use cases that just got unblocked
- Customer support voice agents: the auth-and-intent phase was the #1 bottleneck. More realism = fewer mishears; better tool-calling = fewer botched ticket lookups.
- Outbound voice (sales, scheduling, surveys): tool reliability mid-call is what makes the difference between a qualified lead and a lost one.
- Consumer voice assistants: the prosody jump closes more of the uncanny-valley gap — matters more than benchmarks for retention.
- Telephony / SIP deployments: pairs nicely with OpenAI’s recent SIP GeoIP routing and DTMF event support in the same Realtime update wave.
Limitations & pricing
- Voice still trails text + reasoning agents on realistic-audio τ³-Bench runs — closer, not equal.
- $32 / $64 per 1M audio tokens is not cheap. Use
gpt-realtime-minior cached-input pricing ($0.40/1M) for cost-sensitive workloads. - Custom Voices still gated to eligible customers via sales.
- Sierra’s voice methodology is ~5 weeks old; expect the board to reshuffle as Gemini Live and Grok Voice refresh.
What’s next
Two things to watch. First, competitor response: Gemini Live and Grok Voice both have active roadmaps; a 20-point gap is an invitation, not a moat. Second, Sierra’s per-model scorecards: the aggregate ranges published in March are going to get replaced with individual numbers, and those will tell us whether the gap is uniform across domains or concentrated in (e.g.) telecom-style auth flows.
For builders: if you’ve been waiting for voice agents to cross the reliability threshold before shipping to production, this is the release that moves the needle. The right move is to re-run your own eval harness against gpt-realtime-1.5 this week — especially on tool-calling paths that failed on the December snapshot.
Source: @veggie_eric on X, OpenAI developers blog, Sierra τ³-Bench.
