Alibaba's new open source Qwen3.5-Medium models offer Sonnet 4.5 performance on local computers


Alibaba’s now-famous Quen AI development team has done it again: a little over a day ago, they released the Quen 3.5 Medium Model Series consisting of four new large language models (LLMs) with support for agentic tool calling, three of which are available for commercial use by enterprises and indie developers under the standard open source Apache 2.0 license:

  • Qwen3.5-35B-A3B

  • Qwen3.5-122B-A10B

  • QUEEN3.5-27B

Developers can download them now on Hugging Face and ModelScope. The fourth model, Qwen3.5-Flash, appears to be proprietary and is only available through the Alibaba Cloud Model Studio API, but still offers a strong advantage in cost compared to other models in the West (see pricing comparison table below).

But the big twist with the open source models is that they deliver comparably high performance on third-party benchmark tests to similar-sized proprietary models from major US startups like OpenAI or Anthropic, actually outperforming OpenAI’s GPT-5-Mini and Anthropic’s Cloud Sonnet 4.5 – the latter model being released just five months ago.

And, the Quen team says it has engineered these models to be highly accurate. "quantified," A process that further reduces their footprint by reducing the number by which the model’s settings are stored from many values ​​to very few.

The important thing is that it brings release "marginal level" Reference window for desktop PC. The flagship Qwen3.5-35B-A3B can now exceed 1 million token reference lengths on a consumer-grade GPU with 32GB of VRAM. While this isn’t something everyone has access to, it’s a lot less computationally expensive than many other comparably performing options.

This leap is made possible by nearly lossless accuracy under 4-bit weights and KV cache quantization, allowing developers to process massive datasets without server-grade infrastructure.

Technology: Delta Force

At the heart of the Kwen 3.5’s performance is a sophisticated hybrid architecture. While many models rely solely on standard transformer blocks, QUEN 3.5 integrates a gated delta network combined with a sparse mixture-of-experts (MOE) system. Technical specifications for the Quen 3.5-35B-A3B reveal a highly efficient design:

  • parameter efficiency: While the model has a total of 35 billion parameters, it only activates 3 billion For a given token.

  • specialist diversity: The MoE layer uses 256 experts, including 8 routed experts and 1 shared expert to help maintain performance while reducing inference latency.

  • Near-lossless quantization: The series maintains high accuracy even when compressed to 4-bit weights, significantly reducing the memory footprint for local deployment.

  • base model release: To support the research community, Alibaba has open-sourced Qwen3.5-35B-A3B-Base Models with instruction-tuned versions.

Product: The Wisdom That ‘Thinks’ First

Queue 3.5 introduces a native "thinking mode" as its default state. Before providing the final answer, the model generates an internal logic chain – which is delimited <think> Tags—to work through complex logic. The product lineup is designed for different hardware environments:

  • QUEN3.5-27B: Optimized for high efficiency, supports reference lengths of over 800K tokens.

  • Qwen3.5-Flash: Production-grade hosted version, including default 1 million token reference length and built-in official tools.

  • Qwen3.5-122B-A10B: Designed for server-grade GPUs (80GB VRAM), this model supports 1M+ reference lengths, bridging the gap with the world’s largest Frontier models.

Benchmark results validate this architectural change. The 35B-A3B model notably outperforms its larger predecessors, such as the Qwen3-235B, as well as the aforementioned proprietary GPT-5 Mini and Sonnet 4.5 in categories including cognition (MMMLU) and visual reasoning (MMMU-Pro).

Pricing and API Integration

For those who are not hosting their own weights, Alibaba Cloud Model Studio offers a competing API for Qwen3.5-Flash.

  • input: $0.1 per 1M token

  • Production: $0.4 per 1M token

  • cache creation: $0.125 per 1M token

  • read cache: $0.01 per 1M token

The API also includes a granular tool calling pricing model, with web searches priced at $10 per 1,000 calls, and the code interpreter currently offered for a limited time at no cost.

This makes Qwen3.5-Flash the most economical to run on API among all the major LLMs in the world. Check out the table comparing them below:

Sample

input

Production

total cost

Source

quen 3 turbo

$0.05

$0.20

$0.25

alibaba cloud

Qwen3.5-flash

$0.10

$0.40

$0.50

alibaba cloud

DeepSeek-Chat (V3.2-Exp)

$0.28

$0.42

$0.70

deepseek

DeepSeek-Reasoner (V3.2-Exp)

$0.28

$0.42

$0.70

deepseek

grok 4.1 fast (logic)

$0.20

$0.50

$0.70

xai

grok 4.1 fast (non-argument)

$0.20

$0.50

$0.70

xai

minimax m2.5

$0.15

$1.20

$1.35

minimal maximum

Minimax M2.5-Lightning

$0.30

$2.40

$2.70

minimal maximum

gemini 3 flash preview

$0.50

$3.00

$3.50

Google

km-k2.5

$0.60

$3.00

$3.60

moon

GLM-5

$1.00

$3.20

$4.20

Z.ai

Ernie 5.0

$0.85

$3.40

$4.25

Baidu

cloud haiku 4.5

$1.00

$5.00

$6.00

anthropic

quen3-max (2026-01-23)

$1.20

$6.00

$7.20

alibaba cloud

Gemini 3 Pro (≤200K)

$2.00

$12.00

$14.00

Google

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

cloud sonnet 4.5

$3.00

$15.00

$18.00

anthropic

Gemini 3 Pro (>200K)

$4.00

$18.00

$22.00

Google

cloud opus 4.6

$5.00

$25.00

$30.00

anthropic

GPT-5.2 Pro

$21.00

$168.00

$189.00

OpenAI

What this means for enterprise technology leaders and decision makers

With the launch of the Qwen3.5 medium model, rapid iteration and fine-tuning that was once reserved for well-funded labs is now accessible for on-premise development at many non-tech firms, effectively separating sophisticated AI from large-scale capital expenditures.

Across the organization, this architecture transforms the way data is managed and secured. The ability to incorporate large-scale document stores or hour-level video locally allows for deep institutional analysis without the privacy risks of third-party APIs.

special by driving these "mix of experts" Within a private firewall model, organizations can maintain sovereign control over their data while using native "Thinking" modes and official tool-calling capabilities to create more reliable, autonomous agents.

Early adopters of Hugging Face have particularly appreciated the model’s ability to "close the gap" In agentic scenarios where previously only the largest closed models could compete.

This shift toward architectural efficiency at raw scale ensures that AI integration remains cost-conscious, secure, and agile enough to keep pace with evolving operational requirements.



<a href

Leave a Comment