OpenAI's gpt-realtime-1.5 opens a 20+ point lead on Sierra's voice leaderboard

TL;DR

OpenAI's newest speech-to-speech model — gpt-realtime-1.5, released to the Realtime API on Feb 23, 2026 — now sits 20+ points ahead of the nearest competitor on Sierra’s τ³-Bench voice leaderboard, per a public claim from OpenAI researcher Eric (@veggie_eric). The jump from the December “OG” gpt-realtime is driven by significantly better realism and tool calling — exactly the two axes where voice agents collapse under realistic audio conditions.

What’s new

Sierra’s τ³-Bench (launched Mar 18, 2026) is the first serious full-duplex voice-agent benchmark. It runs customer-support simulations over real audio — with background noise, interruptions, and turn-taking — against the actual Realtime APIs from OpenAI, Google Gemini Live, and xAI Grok Voice. A 20+ point gap on that board is not a press-release flex; it’s the biggest single-release jump reported on the leaderboard to date.

Eric’s framing: “Pretty massive upgrade from our OG model from December, with significantly improved realism and tool calling capabilities.” Translation: the two hardest problems in production voice agents — sounding human and reliably calling APIs mid-conversation — both moved at the same time.

Worth noting why this is a big deal architecturally. Realism and tool calling usually trade off against each other in speech-to-speech models. More expressive audio generation tends to burn compute that could have gone to structured reasoning; tighter tool-calling traces tend to flatten prosody. Shipping both in the same release suggests OpenAI found architectural headroom — likely in the decoder and function-routing layers — rather than pulling one lever at the expense of the other.

Why it matters

Voice agents have been stuck in an awkward middle ground. Sierra’s own numbers tell the story: text agents with reasoning hit ~85% task completion, plain text agents hit ~54%, but voice agents under realistic audio were stuck at 26–38%. The bottleneck wasn’t raw intelligence — it was authentication. Mishear a name or email once, and every downstream tool call fails.

A 20+ point leap directly attacks that failure mode. Better realism means fewer mishears. Better tool calling means fewer dropped CRM writes, fewer broken bookings, fewer failed payment confirmations. That’s the difference between a demo and a product.

Technical facts

Property	Value
Model ID	`gpt-realtime-1.5`
Released	Feb 23, 2026 (Realtime API)
Prior baseline	`gpt-realtime` (Dec 2024) & `gpt-realtime-mini-2025-12-15`
Sierra τ³-Bench voice gap	20+ pts ahead of nearest competitor
Context window	32,000 tokens
Max output	4,096 tokens
Modalities	Text, Image (in), Audio (in/out)
Function calling	Supported
Audio input	$32 / 1M tokens
Audio output	$64 / 1M tokens
Cached audio input	$0.40 / 1M tokens

For context on the trajectory: the Dec 15, 2025 gpt-realtime-mini snapshot alone delivered +18.6 percentage points in instruction-following and +12.9 pp in tool-calling accuracy over its predecessor. gpt-realtime-1.5 is the next rung on that same ladder — now on the full-size model.

Comparison: the voice-provider field

Provider	Model	τ³-Bench voice position
OpenAI	`gpt-realtime-1.5`	Leader — 20+ pt gap
Google	Gemini Live	Within the 26–38% realistic-voice band
xAI	Grok Voice	Within the 26–38% realistic-voice band

Before gpt-realtime-1.5, Sierra described the three providers as “closely matched.” That framing is now obsolete.

Use cases that just got unblocked

Customer support voice agents: the auth-and-intent phase was the #1 bottleneck. More realism = fewer mishears; better tool-calling = fewer botched ticket lookups.
Outbound voice (sales, scheduling, surveys): tool reliability mid-call is what makes the difference between a qualified lead and a lost one.
Consumer voice assistants: the prosody jump closes more of the uncanny-valley gap — matters more than benchmarks for retention.
Telephony / SIP deployments: pairs nicely with OpenAI’s recent SIP GeoIP routing and DTMF event support in the same Realtime update wave.

Limitations & pricing

Voice still trails text + reasoning agents on realistic-audio τ³-Bench runs — closer, not equal.
$32 / $64 per 1M audio tokens is not cheap. Use gpt-realtime-mini or cached-input pricing ($0.40/1M) for cost-sensitive workloads.
Custom Voices still gated to eligible customers via sales.
Sierra’s voice methodology is ~5 weeks old; expect the board to reshuffle as Gemini Live and Grok Voice refresh.

What’s next

Two things to watch. First, competitor response: Gemini Live and Grok Voice both have active roadmaps; a 20-point gap is an invitation, not a moat. Second, Sierra’s per-model scorecards: the aggregate ranges published in March are going to get replaced with individual numbers, and those will tell us whether the gap is uniform across domains or concentrated in (e.g.) telecom-style auth flows.

For builders: if you’ve been waiting for voice agents to cross the reliability threshold before shipping to production, this is the release that moves the needle. The right move is to re-run your own eval harness against gpt-realtime-1.5 this week — especially on tool-calling paths that failed on the December snapshot.

Source: @veggie_eric on X, OpenAI developers blog, Sierra τ³-Bench.

OpenAI's gpt-realtime-1.5 opens a 20+ point lead on Sierra's voice leaderboard

TL;DR

What’s new

Why it matters

Technical facts

Comparison: the voice-provider field

Use cases that just got unblocked

Limitations & pricing

What’s next

Tiếp tục lướt

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

GPT-Image-2 + Seedance 2.0: Vẽ "sơ đồ chuyển động camera" để điều khiển video AI

GPT-5.5 truy bug Worker đến comment 'TODO(perf)' Kenton Varda viết 6 năm trước

Perplexity dùng GPT-5.5 giảm 56% token và build nội bộ dưới 1 giờ: bằng chứng thực tế đầu tiên của thế hệ Codex mới

GPT-5.4 vừa giúp đập thủng Same-Origin Policy của Safari