Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks


Just a few weeks ago, Google launched its gemini 3 The model claimed to have achieved leadership positions in several AI benchmarks. But the challenge with vendor-provided benchmarks is that they are simply vendor-provided.

A new vendor-neutral assessment from fertileHowever, that keeps the Gemini 3 at the top of the leaderboard. It is not based on a set of academic benchmarks; Rather, it is based on a set of real-world characteristics that real users and organizations care about.

Prolific was founded by researchers at the University of Oxford. The company provides high-quality, trusted human data to power rigorous research and ethical AI development. Company’s “Humane Benchmark“It applies this approach using representative human samples and blind testing to rigorously compare AI models in different user scenarios, measuring not only technical performance but also user trust, adaptability, and communication style.

The latest HUMAINE trial evaluated 26,000 users in a blind test of the models. In the evaluation, Gemini 3 Pro’s trust score increased from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one in trust, ethics and safety across demographic subgroups 69% of the time, compared to its predecessor, the Gemini 2.5 Pro, which held the top spot only 16% of the time.

Overall, the Gemini 3 ranks first in three out of four evaluation categories: performance and logic, conversation and adaptability, and trust and security. It lost only in communication style, where DeepSeek v3 came out on top in preference with 43%. HUMAN testing also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, gender, ethnicity, and political orientation. The evaluation also found that users are now five times more likely to choose a model than in a head-to-head comparison.

But rankings matter less Why It won.

"It has consistency across a wide range of different use cases, and has a personality and a style that appeals to a wide variety of users," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred either by smaller subgroups or on a particular conversation type, it is the breadth of knowledge and the flexibility of the model across different use cases and audience types that allowed it to win this particular benchmark."

How indiscriminate testing reveals what academic standards miss

HUMAINE’s methodology highlights shortcomings in the way the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don’t know which vendors power each response. They discuss all the topics that matter to them, not predetermined test questions.

It is the pattern that counts. HUMAINE uses representative samples in the US and UK populations, controlling for age, gender, ethnicity and political orientation. This reveals something that static benchmarks cannot capture: model performance varies by audience.

"If you take AI leaderboards, most of them may still have a fairly stable list," Bradley said. "But for us, if you control for audience, we end up with a slightly different leaderboard, whether you’re looking at a left-leaning sample, a right-leaning sample, US, UK. And I think age was actually the most variable reported in our experiment."

For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may perform poorly for another.

This methodology also addresses a fundamental question in AI evaluation: why use human judges when AI can evaluate itself? Bradley said his company uses AI judges in some use cases, though he stressed that human evaluation is still a key factor.

"We see the greatest benefits from smart orchestration of both LLM judge and human data, both have strengths and weaknesses that, when combined smartly, perform better together," Bradley said. "But we still think human data is where the alpha is. We are still very confident that human data and human intelligence need to be in the loop."

What does confidence mean in AI evaluation?

Trust, ethics and security measure user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE’s methodology, trust is not a vendor claim or a technical metric – it is what users report after blind interactions with competing models.

The figure of 69% reflects the probability across demographic groups. This consistency matters more than the overall score because organizations may serve diverse populations.

"Had no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on blind multi-turn reaction."

This differentiates perceived trust from earned trust. Users evaluated model outputs without knowing which vendor produced them, eliminating Google’s brand advantage. For customer-facing deployments where the AI ​​vendor remains invisible to end users, this difference matters.

What should enterprises do now

One of the key things enterprises should do now when considering different models is to adopt a valuation framework that works.

"It is becoming challenging to evaluate models based exclusively on vibes," Bradley said. "I think we need a more rigorous, scientific approach to truly understand how these models are performing."

HUMAINE data provides a framework: testing for consistency across use cases and user demographics, not just for peak performance on specific tasks. Blind testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as the model changes.

For enterprises looking to deploy AI at scale, this means moving from "which model is best" To "Which model is best for our specific use case, user demographics and required features."

The rigor of representative sampling and blind testing provides the data to make that determination – something that technical benchmarks and vibes-based evaluations cannot provide.



<a href

Leave a Comment