Amazon’s Bet That AI Benchmarks Don’t Matter

This is an excerpt from Alex Heath’s Sources, a newsletter about AI and the tech industry that is syndicated to The Verge subscribers only once a week.

Amazon’s AI chief has a message for model benchmark obsessives: Stop looking at the leaderboard.

“I want real-world usability. None of these benchmarks are realistic,” Rohit Prasad, Amazon’s SVP of AGI, told me ahead of today’s announcements at AWS Re:Invent in Las Vegas. “The only way to do real benchmarking is to have everyone conform to the same training data and stop evaluations altogether. That’s not happening. The evaluations are obviously noisy, and they’re not showing the real power of these models.”

This is a contradictory stance, when every other AI lab is quick to boast about how their new models climb the leaderboards faster. It’s also convenient for Amazon, given that the previous version of the Nova, its flagship model, was ranked 79th on LMArena when Prasad and I spoke last week. Still, dismissing the benchmarks only works if Amazon can present a different story about what progress looks like.

“They’re not showing the real power of these models.”

The centerpiece of today’s Re:Invent announcements is Nova Forge, a service that Amazon claims lets companies train custom AI models in ways that were previously impossible, without spending billions of dollars. The problem of forge addresses is real. Most companies trying to optimize AI models face three bad choices: fixing a closed model (but only at the edges), training on an open-weighted model (but without the original training data and risking capability regression, where the AI becomes expert on the new data but forgets the original, broader skills), or building a model from scratch at enormous cost.

Forge also offers something else: access to Amazon’s Nova Model checkpoints in the pre-training, mid-training, and post-training stages. As Prasad said, companies can inject their proprietary data early in the process, when the model’s “learning ability is highest,” rather than making changes to the model’s behavior at the end.

“What we’ve done is democratize AI and frontier model development for your use cases at a fraction of the cost [before]” Prasad said. Forge was created because Amazon’s internal teams wanted a tool to inject their domain expertise into the base model without building it from scratch.

“We created Forge because our internal teams wanted Forge,” he said. It’s a familiar Amazon pattern. AWS began as infrastructure built for Amazon’s own retail operations before becoming the company’s profit engine.

Reddit is using Forge to build custom security models trained on 23 years of community moderation data. “I’ve never seen anything like it,” Chris Slowe, Reddit’s CTO and first employee, told me. “We have a distinguished engineer who works like a kid in a candy store.”

Slowe said Reddit continued to run pre-training work last week that “looks really promising.” The goal: Replace multiple specialized safety models with a single Reddit-expert model that understands the nuances of community moderation, including the notoriously subjective rule that appears in subreddits everywhere: “Don’t be a jerk.”

“Having an expert model, it’s going to make the community understand,” Slowe said. “That’ll give you a pretty good idea of what a jerk means.”

That’s the formula Amazon wants developers to pursue: not raw IQ points, but control and expertise.

He explained that Forge enables Reddit to control its models, avoid surprises from API changes, retain ownership of its weights, and avoid sending sensitive data to third-party model providers. He said Reddit is already using the same approach for Reddit Answers and other products.

When I asked Slowey if it mattered that the Nova isn’t the top-tier model on the benchmark, he was clear: “In this context, what matters is the Reddit expertise of the model.” That’s the formula Amazon wants developers to pursue: not raw IQ points, but control and expertise.

With Forge, Amazon is making a calculated bet that the model race has commoditized and that it can succeed by becoming the place where companies can build specialized AI for specific business problems. It’s a very AWS-shaped view of the world: infrastructure over intelligence and optimization over raw capacity. This strategy also allows Amazon to avoid direct comparisons with OpenAI and Anthropic, both of which once hoped to compete at the model level.

Whether Forge is truly pioneering or just clever positioning depends of course on developer adoption. Amazon emphasizes that model race, as widely understood, does not matter. If this turns out to be true, then the scoreboard turns into something much quieter and harder to game: whether AI models actually provide utility in the real world.

Follow topics and authors To see more like this in your personalized homepage feed and get email updates from this story.

alex heath

<a href

Amazon’s bet that AI benchmarks don’t matter

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply