Qwen3-Max Thinking beats Gemini 3 Pro and GPT-5.2 on Humanity's Last Exam (with search)

Carl Franzen vibrant lush pop line art vector art flat with gra 363afe34 53fa 448f 9565 a5d5bc70ce43
Chinese AI and tech companies continue to impress with the development of cutting-edge, cutting-edge AI language models.

Today, Alibaba Cloud’s QUEN team of AI researchers and the unveiling of its new proprietary language reasoning model, QUEN3-Max-Thinking, are attracting attention.

You may recall, as VentureBeat covered last year, that Quon has made a name for itself in the fast-moving global AI marketplace by offering a variety of powerful, open source models across a variety of modalities, from text to images to spoken audio. The company also picked up support from US tech lodging giant Airbnb, whose CEO and co-founder Brian Chesky said the company was banking on Quen’s free, open source model as a more affordable alternative to US offerings like OpenAI.

Now, with the proprietary Qwen3-Max-Thinking, the Qwen team aims to match, and in some cases surpass, the reasoning capabilities of GPT-5.2 and Gemini 3 Pro through architectural efficiency and agentic autonomy.

The release comes at a critical juncture. Western laboratories have largely defined this "logic" Category (often dubbed "system 2" logic), but Quoin’s latest benchmarks suggest the gap has closed.

Additionally, the company’s relatively affordable API pricing strategy aggressively targets enterprise adoption. However, since it is a Chinese model, some US companies with strict national security requirements and considerations may be wary of adopting it.

architecture: "test-time scaling" redefined

The main innovation driving Quen3-Max-Thinking is its departure from standard estimation methods. While most models generate tokens linearly, Qwen3 uses "heavy mode" Powered by technology called "Test-Time Scaling."

In simple terms, this technique allows models to trade computation for intelligence. but unlike naive "best-n" Sampling – where a model can generate 100 answers and choose the best – Quen3-Max-Thinking employs an experience-accumulating, multi-round strategy.

This approach mimics human problem-solving. When the model faces a complex query, it doesn’t just make guesses; It engages in iterative self-reflection. It uses a possessive "take-experience" A mechanism for gaining insight from previous reasoning steps. This allows the model to:

  1. Identify dead ends: Recognize when a line of argument is failing without crossing it out completely.

  2. Focus Calculation: redirect processing power "unresolved uncertainties" Instead of retrieving known findings.

The efficiency gains are tangible. By avoiding redundant logic, the model integrates rich historical context into a single window. The Quen team reports that this approach has led to massive performance increases without exploding token costs:

  • GPQA (PhD level science): The score improved from 90.3 to 92.8.

  • LiveCodeBench v6: Performance increased from 88.0 to 88.0 91.4.

Beyond pure ideation: adaptive tooling

Whereas "Thinking" Models are powerful, they have been historically siled – great at math, but weak at browsing the web or running code. Quen3-Max-Thinking bridges this gap by effectively integrating "Thinking and non-thinking methods".

The model has adaptive tool-use capabilities, meaning it automatically selects the right tool for the task without manual user prompting. It can toggle seamlessly between:

  • Web Search and Extraction: For real time factual queries.

  • Memory: To store and recall user-specific context.

  • Code Interpreter: Writing and executing Python snippets for computational tasks.

In "thinking mode," The model supports these devices simultaneously. This capability is important for enterprise applications where a model may need to verify a fact (discovery), calculate a projection (code interpreter), and then reason about strategic implications (thinking) all at once.

Empirically, the team noted that this combination "Effectively reduces hallucinations," Because the model can base its reasoning on verifiable external data instead of relying only on its training weights.

Benchmark Analysis: The Data Story

Quine isn’t shy about making direct comparisons.

On February 25, on HMMT, a rigorous logic benchmark, Quen3-Max-Thinking scored 98.0, beating Gemini 3 Pro (97.5) and well ahead of DeepSeek v3.2 (92.5).

However, the most important signal for developers is arguably agentic search. But "final test of humanity" (HLE) – benchmark that measures performance at 3,000 "google proof" Undergraduate level questions in Mathematics, Science, Computer Science, Humanities and Engineering – Qween3-Max-Thinking, equipped with the web search tool, scored 49.8, beating both Gemini 3 Pro (45.8) and GPT-5.2-Thinking (45.5). .

This shows that the architecture of Quen3-Max-Thinking is uniquely suited to complex, multi-step agentic workflows where external data retrieval is necessary.

The model shines in coding tasks too. On Arena-Hard v2, it posted a score of 90.2, leaving competitors like cloud-opus-4.5 (76.7) far behind.

Economics of Argument: Pricing Breakdown

For the first time, we have a clear look at the economics of Quine’s top-level logic model. Alibaba Cloud has positioned qwen3-max-2026-01-23 As a premium but accessible offering on its API.

  • Input: $1.20 Per 1 million tokens (<=32k for standard references).

  • Output: $6.00 Per 1 million tokens.

At a base level, here’s how Qwen3-Max-Thinking stacks up:

Sample

Input(/1M)

Output(/1M)

total cost

Source

quen 3 turbo

$0.05

$0.20

$0.25

alibaba cloud

grok 4.1 fast (logic)

$0.20

$0.50

$0.70

xai

grok 4.1 fast (non-argument)

$0.20

$0.50

$0.70

xai

DeepSeek-Chat (V3.2-Exp)

$0.28

$0.42

$0.70

deepseek

DeepSeek-Reasoner (V3.2-Exp)

$0.28

$0.42

$0.70

deepseek

queen 3 plus

$0.40

$1.20

$1.60

alibaba cloud

Ernie 5.0

$0.85

$3.40

$4.25

qianfan

gemini 3 flash preview

$0.50

$3.00

$3.50

Google

cloud haiku 4.5

$1.00

$5.00

$6.00

anthropic

Quen3-Max Thinking (2026-01-23)

$1.20

$6.00

$7.20

alibaba cloud

Gemini 3 Pro (≤200K)

$2.00

$12.00

$14.00

Google

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

cloud sonnet 4.5

$3.00

$15.00

$18.00

anthropic

Gemini 3 Pro (>200K)

$4.00

$18.00

$22.00

Google

cloud opus 4.5

$5.00

$25.00

$30.00

anthropic

GPT-5.2 Pro

$21.00

$168.00

$189.00

OpenAI

This pricing structure is aggressive, outperforming many older flagship models while offering cutting-edge performance.

However, developers should pay attention to detailed pricing for new agentive capabilities, as Quen breaks down the costs. "Thinking" at a cost of (token) "doing" (use of equipment).

  • Agent Search Strategy: both standards search_strategy:agent more advanced search_strategy:agent_max is priced at $10 per 1,000 calls.

    • Comment: agent_max The strategy is currently marked as a "limited time offer," There are suggestions that its price may increase later.

  • Web Search: The price is $10 per 1,000 calls through the Response API.

Promotional Free Tier:To encourage adoption of its most advanced features, Alibaba Cloud is currently offering two key tools for free for a limited time:

  • Web Extractor: Free (limited time).

  • Code Interpreter: Free (limited time).

This pricing model (low token cost + a la carte tool pricing) allows developers to create complex agents that are cost-effective for text processing, while paying a premium only when external actions – such as live web search – are explicitly triggered.

developer ecosystem

Recognizing that performance is useless without integration, Alibaba Cloud has made sure that Quen3-Max-Thinking is ready for drop-in.

  • OpenAI Compatibility: The API supports the standard OpenAI format, allowing teams to switch by simply changing the model base_url And model Name.

  • Human Compatibility: In a savvy move to capture the coding market, the API also supports the Anthropic protocol. This makes Qwen3-Max-Thinking compatible cloud codeA popular agentic coding environment.

Decision

Quen3-Max-Thinking represents the maturity of the AI ​​market in 2026. It moves the conversation forward "Who has the smartest chatbot" To "Who has the most capable agent."

By combining high-efficiency logic with adaptive, autonomous device use—and pricing it to move—Quen has firmly established itself as a top-tier contender for the enterprise AI throne.

For developers and enterprises, "limited time free" Windows suggests that now is the time to experiment with the code interpreter and web extractor. The argumentative wars are not over yet, but Quen has just deployed a very heavy hitter.



<a href

Leave a Comment