Scale AI launches Voice Showdown, the first real-world benchmark for voice AI — and the results are humbling for some top models

Gemini Generated Image 8crymu8crymu8cry
Voice AI is advancing faster than the tools we use to measure it. Every major AI lab – OpenAI, Google DeepMind, Anthropic, XAI – is racing to ship voice models capable of carrying out natural, real-time conversations.

But the benchmarks used to evaluate those models are still largely run on synthetic speech, English-only prompts, and scripted test sets that bear little resemblance to the way people actually talk.

Scale AI, the big data annotation startup whose founder was chosen by Meta last year to lead its Superintelligence Lab, is still going strong and tackling the problem: Today it launches Voice Showdown, what it calls the first global preference-based arena designed to benchmark voice AI through the lens of real human interaction.

This product offers users a unique strategic value: free access to the world’s leading frontier models. Through Scale’s Chatlab platform, users can interact with high-end models — which typically require a subscription of more than $20 per month — at no cost. In exchange, users participate in occasional blind, head-to-head "battle" To choose which of two anonymous leading voice models provides a better experience, the industry’s most authentic, human-preference leaderboard provides data for voice AI models.

"Voice AI is actually the fastest growing area in AI at the moment," said Jenny Gu, Showdown’s product manager at Scale AI. "But the way we evaluate voice models hasn’t kept up."

The results, derived from thousands of spontaneous voice conversations in over 60 languages, reveal capability gaps that other benchmarks have consistently missed.

How does Scale’s Voice Showdown work?

Voice Showdown Chatlab is built on Scale’s model-agnostic chat platform, where users can interact for free with whichever Frontier AI model they choose within the same app. The platform is available to Scale’s global community of more than 500,000 annotators, approximately 300,000 of whom have submitted at least one signal. Scale is opening the platform to the public waiting list today.

The evaluation mechanism is elegant in its simplicity: When a user is having a natural voice conversation with a model, the system occasionally – on less than 5% of all voice signals – performs a blind-side comparison. The same signal is sent to another, unnamed model, and the user chooses which response they like.

This design addresses three problems that plague existing voice benchmarks.

First, each signal comes from actual human speech – complete with accents, background noise, half-finished sentences and conversation fillers – rather than synthesized audio generated from text.

Second, the platform spans more than 60 languages ​​across 6 continents, with more than a third of battles taking place in non-English languages, including Spanish, Arabic, Japanese, Portuguese, Hindi and French.

Third, because battles take place in users’ actual daily interactions, 81% of the prompts are conversational or open-ended – questions without a single correct answer. This rejects automated scoring and makes human preference the only reliable indicator.

Voice Showdown currently runs two evaluation modes: Dictate (user speaks, model responds with text) and Speech-to-Speech, or S2S (speech-to-speech, user speaks, model talks back). A third mode – full-duplex, which captures real-time, uninterrupted conversations – is in development.

incentive-aligned voting

One design detail that differentiates Voice Showdown from Chatbot Arena (LM Arena) is that it most closely resembles text benchmarks. In LM Arena, critics have noted that users sometimes cast throwaway votes with little stake in the outcome. Voice Showdown addresses this directly: after a user votes for a favorite model, the app switches them to that model for the rest of their conversation. If you voted GPT-4o Audio over Gemini, you are now talking GPT-4o Audio. Alignment of results with preferences discourages accidental or dishonest voting.

The system also controls for confounds that could contaminate comparisons: both model responses begin to stream simultaneously (eliminating speed bias), voice genders in both options are matched (eliminating gender preference bias), and neither model is identified by name during voting.

Every enterprise decision maker should pay attention to the new voice AI leaderboard

The Voice Showdown launches with 11 Frontier models evaluated in 52 model-voice pairs through March 18, 2026. Not all models support both evaluation modes – Dictate Leaderboard includes 8 models, while S2S includes 6.

Dictate leaderboard (speech-in, text-out)

In this mode, users provide a verbal prompt and evaluate two side-by-side text responses. Here are the baseline scores:

  1. gemini 3 pro (1073)

  2. gemini 3 flash (1068)

  3. GPT-4o audio (1019)

  4. QUEEN 3 OMNI (1000)

  5. voxtral small (925)

  6. Gemma 3n (918)

  7. gpt realtime (875)

  8. fi-4 multimodal (729)

Comment: The Gemini 3 Pro and Gemini 3 Flash are statistically tied for the top rank.

Speech-to-Speech (S2S) Leaderboard

In this mode, users talk to the model and evaluate two competing audio responses. Also the basics:

  1. gemini 2.5 flash audio (1060)

  2. GPT-4o audio (1059)

  3. grok voice (1024)

  4. QUEEN 3 OMNI (1000)

  5. gpt realtime (962)

  6. GPT Realtime 1.5 (920)

Comment: Gemini 2.5 Flash Audio and GPT-4O Audio are statistically tied for the top rank in the baseline evaluation.

The Dictate ranking is led by Google’s Gemini 3 Pro and Gemini 3 Flash, which are statistically tied at #1 with an Elo score around 1,043-1,044 after style control.

GPT-4o Audio clearly takes third place. Open-weight models including Gemma3n, Voxtral Small and Phi-4 multimodel are far ahead.

Speech-to-speech (S2S) rankings show strong competition at the top, with Gemini 2.5 Flash Audio and GPT-4O Audio statistically tied at #1 in the baseline rankings.

After adjusting response length and formatting – factors that can increase perceived quality – GPT-4o audio leads the way (1,102 elo versus 1,075 for Gemini 2.5 flash audio).

Under Style Control Grok Voice moves up to second place at 1,093, which suggests that its raw #3 ranking underestimates its actual performance quality.

The Quen 3 Omni, the open-weight model from Alibaba’s Quen team, outperforms its popularity – ranking fourth in both modes, ahead of several high-profile names.

"When people come, they go for the big names," Gu noted. "But for preference, lesser-known models like Quen really take the lead."

Real-world preference data reveals surprises

Beyond the rankings, the real value of Voice Showdowns is in failure diagnosis – and they paint a more complex picture of voice AI than most leaderboards.

Multilingual differences are worse than you think

The biggest difference between all the models is the robustness of the language. In Dictate, the Gemini 3 models lead in essentially every language tested.

In S2S, the winner depends largely on which language is being spoken: GPT-4o leads in audio Arabic and Turkish; Gemini 2.5 Flash Audio is strongest in French; Grok Voice is available in Japanese and Portuguese.

But a more worrying finding is how often some models simply stop responding in the user’s language.

GPT RealTime 1.5 – OpenAI’s new real-time voice model – responds in English to non-English prompts about 20% of the time, even on high-resource, officially supported languages ​​like Hindi, Spanish, and Turkish.

Its predecessor, GPT Realtime, mismatches at about half the rate (~10%). Gemini 2.5 flash audio and GPT-4O audio sit at ~7%.

The phenomenon runs in both directions: some models turn a non-English context from earlier in the conversation into an English one, or simply misunderstand a signal and produce an unrelated response in the wrong language altogether.

Users clearly express their frustration verbatim from the forum: "I said I have an interview with Quest Management today and instead of answering, it gave me information about ‘risk management’."

"GPT Realtime 1.5 thought I was speaking incoherently and recommended mental health support, while Quen 3 Omni correctly identified that I was speaking the Nigerian vernacular."

Existing benchmarks miss this: they are built on synthetic speech optimized for clean acoustic conditions, and they are rarely multilingual. Real speakers in real environments – with background noise, small accents, and regional accents – distort speech understanding in ways that laboratory conditions would not expect.

Voice selection is more important than aesthetics

Voice Showdown evaluates models not just at the model level but at the individual voice level – and the variation within a model’s voice catalog is striking.

For an unnamed model in the study, the best-performing voice won 30 percentage points more than the worst-performing voice from the same underlying model. Both voices share similar arguments and generational backgrounds. The difference is purely in the audio presentation.

Top performing voices win or lose on audio understanding and content completeness – whether the model heard you correctly and responded completely. But speech quality remains a decisive factor at the voice selection level, especially when the models are otherwise comparable. "Voice directly determines how users evaluate a conversation," Gu said.

models degrade in conversation

Most benchmarks test a single twist. Voice Showdown tests how models hold up in extended conversations – and the results are not favorable.

At Turn 1, material quality accounted for 23% of model failures. At Turn 11 and onwards, it becomes the primary failure mode at 43%. As negotiations progress, most models see their win rates decline, struggling to maintain coherence across many exchanges.

GPT Realtime variants are an exception, with modest improvements on later turns – consistent with their known strengths on long contexts, and their documented weakness on the short, noisy statements that dominate early conversations.

Prompt length shows a complementary pattern: short prompts (less than 10 seconds) dominate audio comprehension failures (38%), while long prompts (longer than 40 seconds) shift the primary failure toward content quality (31%). Smaller audio models provide less acoustic context to parse; Long requests are understood but difficult to respond to well.

Why do some voice AI models lose?

After each S2S comparison, users tag why they preferred one response over another across three axes: audio intelligibility, content quality, and speech output. Failure signatures vary meaningfully by model.

The Qween 3 Omni’s shortcomings cluster around speech production – its logic is competitive, but users are left disappointed by the way it listens. The disadvantages of GPT Realtime 1.5 are dominated by audio comprehension failures (51%), which is consistent with its language-switching behavior on challenging signals. Grok Voice’s failures are more balanced across all three axes, which doesn’t indicate any one major weakness, but also no particular strength.

what will happen next

The current leaderboard covers turn-based interactions – you speak, model responds, repeat. But actual voice conversations don’t work that way. People interrupt, change directions mid-sentence, and talk over each other.

Scales says full-duplex assessment – ​​designed to capture these real-time dynamics through human prioritization rather than scripted scenarios or automated metrics – is coming to the next showdown. No existing benchmark captures full-duplex interactions through biological human preference data.

The leaderboard is live at scale.com/showdown. A public waiting list to join the chatlab and vote on the comparisons is open today, with users receiving free access to Frontier voice models including GPT-4o, Gemini, and Grok in exchange for occasional preference votes.



<a href

Leave a Comment