
Zoom Video Communications, the company best known for keeping remote workers engaged during the pandemic, announced last week that it had achieved the highest score ever on one of the most demanding tests of artificial intelligence — a claim that sent waves of surprise, skepticism and genuine curiosity through the technology industry.
The San Jose-based company said its AI system scored 48.1 percent on the Humanities Last Exam, a benchmark designed by subject-matter experts around the world to outperform even the most advanced AI models. This result is ahead of Google’s Gemini 3 Pro, whose previous record was 45.8 percent.
"Zoom achieved a new state-of-the-art result on the challenging Humanities Last Exam full-set benchmark, scoring 48.1%, representing a substantial improvement of 2.3% compared to the previous SOTA result." Xuedong Huang, Zoom’s chief technology officer, wrote in a blog post.
The announcement raises a tantalizing question that has troubled AI watchers for days: How did a video conferencing company – with no public history of training large language models – suddenly overtake Google, OpenAI and Anthropic on a benchmark designed to measure the limits of machine intelligence?
The answer shows where AI is headed and also speaks to Zoom’s own technological ambitions. And depending on who you ask, it’s either an ingenious demonstration of practical engineering or an empty claim that takes credit for the work of others.
How Zoom built an AI traffic controller instead of training its own models
Zoom did not train its own large language models. Instead, the company developed what it calls "Federated AI approach" – A system that routes queries to multiple existing models from OpenAI, Google, and Anthropic, then uses proprietary software to select, combine, and refine their outputs.
At the center of this system sits what Zoom calls its own "Z-Scorer," A mechanism that evaluates responses from different models and selects the best model for any given task. The company associates it with what it describes "Investigation-Verification-Confederation Strategy," An agentic workflow that balances exploratory reasoning with verification in many AI systems.
"Our federated approach combines Zoom’s own small language model with advanced open-source and closed-source models," Huang wrote. framework "Organizes diverse models to generate, challenge, and refine arguments through dialectical collaboration."
In simple terms: Zoom built a sophisticated traffic controller for AI, not for AI.
That difference matters a lot in an industry where bragging rights – and valuations worth billions – often depend on who can boast the most capable model. Major AI labs spend millions of dollars training Frontier systems on huge computing clusters. In contrast, Zoom’s achievement appears to rest on its clever integration of those existing systems.
Why AI researchers are divided on what counts as real innovation
The reaction from the AI community was swift and sharply divided.
Max Rumpf, an AI engineer who says he has trained cutting-edge language models, posted a scathing criticism on social media. "Zoom provides simultaneous API calls to Gemini, GPT, Cloud, and others. and little improvement on a benchmark that provides no value to its customers," He has written. "Then they claim SOTA."
Rumpf did not just reject the technical approach. That said, there are multiple models to be used for different tasks. "Actually this is quite smart and most applications should do this." He pointed to AI customer service company Sierra as an example of this multi-model strategy executed effectively.
His objection was more specific: "They did not train the model, but obscured this fact in the tweet. The injustice of taking credit for the work of others remains deeply ingrained in people."
But other observers saw the achievement differently. Hongcheng Zhu, a developer, offered a more measured assessment: "To top AI evaluation, you’ll probably need model federation, like Zoom did. An analogy is that every Kaggle competitor knows that to win the competition you have to assemble models."
The comparison to Kaggle – the competitive data science platform where combining multiple models is standard practice among winning teams – redefines Zoom’s approach as industry best practice rather than sleight of hand. Academic research has long established that ensemble methods routinely outperform individual models.
Still, the debate exposed a fault line in the way the industry understands progress. Ryan Priem, founder of Exoria AI, dismissed: "Zoom is just building a harness around another LLM and reporting on it. It’s just noise." Another commentator highlighted the extreme unexpectedness of the news: "While video conferencing app ZOOM developed a SOTA model that achieved 48% HLE, it wasn’t on my bingo card."
Perhaps the sharpest criticism concerns priorities. Rumpf argued that Zoom could have directed its resources toward problems its customers actually face. "The retrieval of call transcripts has not been ‘resolved’ by SOTA LLMs," He has written. "I think Zoom users will care about this more than HLE."
The Microsoft giant has staked its reputation on a different kind of AI
If Zoom’s benchmark results came out of nowhere, neither did its Chief Technology Officer.
Xuedong Huang joins Zoom from Microsoft, where he spent decades building the company’s AI capabilities. He founded Microsoft’s Speech Technology Group in 1993 and led teams that achieved human equivalence in speech recognition, machine translation, natural language understanding, and computer vision.
Huang has a Ph.D. Is. in Electrical Engineering from the University of Edinburgh. He is an elected member of the National Academy of Engineering and the American Academy of Arts and Sciences, as well as a Fellow of both the IEEE and ACM. His credentials place him among the most accomplished AI executives in the industry.
Their presence at Zoom signals that the company has serious AI ambitions, even if its methods differ from those of headline-grabbing research labs. In his tweet celebrating the benchmark result, Huang touted the achievement as validation of Zoom’s strategy: "We have unlocked strong capabilities in exploration, reasoning, and multi-model collaboration, overcoming the performance limitations of any single model."
That last sentence – "Exceeding the performance limits of a single model" – may be the most important. Huang is not claiming that Zoom has created a better model. He is claiming that Zoom has created a better system for models to use.
Inside a test designed to stump the world’s smartest machines
The benchmark at the center of this controversy, the ultimate test of humanity, was designed to be exceptionally difficult. Unlike earlier tests, which found AI systems learning to play games through pattern matching, the HLE presents problems that require genuine understanding, multi-step reasoning, and synthesis of information across complex domains.
The exam asks questions from experts around the world, covering areas ranging from advanced mathematics to philosophy to specialized scientific knowledge. A score of 48.1 percent may seem unimpressive to anyone accustomed to school grading curves, but in HLE terms, it represents the current limit of machine performance.
"This benchmark was developed by subject-matter experts globally and has become a key metric for measuring AI’s progress toward human-level performance on challenging intellectual tasks." Zoom’s announcement was noted.
The company’s 2.3 percentage point improvement compared to Google’s previous best may seem modest in isolation. But in competitive benchmarking, where gains often come in fractions of a percent, such jumps attract attention.
What Zoom’s approach tells us about the future of enterprise AI
Zoom’s approach has implications that extend far beyond benchmark leaderboards. The company is signaling an approach to enterprise AI that is fundamentally different from the model-centric strategies adopted by OpenAI, Anthropic and Google.
Rather than betting everything on building a single most capable model, Zoom is positioning itself as an orchestration layer — a company that can integrate the best capabilities from multiple providers and deliver them through the products that businesses already use every day.
This strategy avoids severe uncertainty in the AI market: no one knows which model will be the best next month, let alone next year. By building infrastructure that can be interchanged between providers, Zoom theoretically avoids vendor lock-in while providing customers with the best available AI for any task.
The announcement of OpenAI’s GPT-5.2 the next day underlined this dynamic. OpenAI’s own communication named Zoom as a partner that evaluated the performance of the new model. "Measurable gains were seen in their AI workloads and across the board." Zoom, in other words, is a customer of Frontier Labs and now has a competitor on its benchmarks – using its technology.
This arrangement may prove sustainable. Major model providers have every incentive to sell API access widely, even to companies that can aggregate their outputs. The more interesting question is whether Zoom’s orchestration capabilities constitute genuine intellectual property or merely sophisticated quick engineering that others can replicate.
The real test comes when Zoom’s 300 million users start asking questions
Zoom titled its announcement section on industry relations "A collaborative future," And Huang expressed gratitude throughout. "The future of AI is collaborative, not competitive" He has written. "By combining the best innovations from across the industry with our own research breakthroughs, we create solutions that are much greater than the sum of their parts."
This framing positions Zoom as a profitable integrator, bringing together industry best practices for the benefit of enterprise customers. Critics see something else: a company that claims the reputation of an AI laboratory without doing the basic research that earns it.
The debate will likely be decided not by leaderboards but by products. When AI Companion 3.0 reaches Zoom’s millions of users in the coming months, they’ll be making their verdict — not on benchmarks they’ve never heard of, but on whether what really matters in the meeting summary, what action items made sense, whether the AI saved or wasted their time.
Finally, Zoom’s most provocative claim may not be that it tops a benchmark. The underlying argument may be that in the age of AI, the best model is not the one you build – it’s the one you know how to use.
<a href