Alibaba's small, open source Qwen3.5-9B beats OpenAI's gpt-oss-120B and can run on standard laptops

Gemini Generated Image 94n91t94n91t94n9
Despite political turmoil in the US AI field, progress in AI in China continues apace without interruption.

Earlier today, e-commerce giant Alibaba’s Quen team of AI researchers primarily focused on developing and releasing to the world the powerful and capable Quen open source language and growing family of multimodal AI models, unveiled its latest batch, Quen 3.5 Small Model Series, which includes:

  • Qwen3.5-0.8B&2B: Two models, both customized "Small" And "Fast" Intended for demonstration, prototyping, and deployment on edge devices where battery life is paramount.

  • Quen3.5-4b: A robust multimodal base for lightweight agents, natively supporting 262,144 token context windows.

  • QUEEN3.5-9B A compact reasoning model that outperforms larger US rival OpenAI’s open source GPT-OSS-120b by 13.5 times on key third-party benchmarks, including multilingual knowledge and graduate-level reasoning.

To put this in perspective, these models are on the order of the smallest general-purpose models recently sent by any laboratory worldwide, comparable to MIT offshoot LiquidAI’s LFM2 series, which has several hundred million or even billions of parameters, compared to the estimated trillions of parameters (model settings) reportedly used for the flagship models of OpenAI, Anthropic, and Google’s Gemini series.

Weights for models are available now globally under the Apache 2.0 license – perfect for enterprise and commercial use, including customization as needed – on Hugging Face and ModelScope.

Technology: hybrid efficiency and native diversity

The technical foundation of Qwen3.5 small series is different from the standard transformer architecture. Alibaba has moved towards an efficient hybrid architecture that combines gated delta networks (a form of linear attention) with sparse mixture-of-experts (MOE).

This hybrid approach addresses "memory wall" Which is generally limited to smaller models; By using gated delta networks, models achieve high throughput and significantly lower latency during inference.

Furthermore, these models are inherently multimodal. unlike previous generations "bolted on" Qwen3.5, a vision encoder for text models, was trained using initial fusion on multimodal tokens. This allows the 4B and 9B models to demonstrate a level of visual understanding – such as reading UI elements or counting objects in a video – that previously required models ten times their size.

benchmarking "Small" Series: Performances that defy scale

Newly released benchmark data shows just how aggressively these compact models are competing with – and often surpassing – larger industry standards. The Qwen3.5-9B and Qwen3.5-4B variants demonstrate cross-generational leaps in efficiency, particularly in multimodal and logic tasks.

Multimodal Dominance: In the MMMU-Pro visual reasoning benchmark, the Qwen3.5-9B achieved a score of 70.1, outperforming the Gemini 2.5 Flash-Lite (59.7) and even the specialized Qwen3-VL-30B-A3B (63.0).

Graduate level reasoning: On the GPQA Diamond benchmark, the 9B model reached a score of 81.7, surpassing the GPT-OSS-120B (80.1), a model whose parameter count was more than ten times higher.

Video Understanding: The series shows exceptional performance in video reasoning. On the video-MME (with subtitles) benchmarks, the Kwen3.5-9B scored 84.5 and the 4B scored 83.5, well ahead of the Gemini 2.5 Flash-Lite (74.6).

Mathematical Skills: In the HMMT February 2025 (Harvard-MIT Mathematics Tournament) assessment, the 9B model scored 83.2, while the 4B variant scored 74.0, proving that high-level STEM reasoning no longer requires massive computation clusters.

Documents and Multilingual Knowledge: The 9B variant leads in document recognition on OmniDocBench v1.5 with a score of 87.7. Meanwhile, it maintains a top-tier multilingual presence on MMMLU with a score of 81.2, outperforming GPT-OSS-120b (78.2).

Community Reactions: "More intelligence, less calculation"

Hot on the heels of last week’s release of an already significantly smaller, powerful open source Qwen3.5-Medium that is capable of running on a single GPU, the announcement of the Qwen3.5-Small model series and their even smaller footprint and processing requirements has sparked immediate interest among developers. "local-first" Ai.

"More intelligence, less calculation" It resonated with users wanting alternatives to cloud-based models.

Blueshell AI’s AI and tech educator Paul Coovert captured the industry’s shock regarding this efficiency leap.

"How is this even possible?!" Couvart wrote on X. "Quen has released 4 new models and the 4B version is almost as capable as the previous 80B A3B. And 9B is just as good as GPT OSS 120B while being 13 times smaller!"

Coovert’s analysis highlights the practical implications of these architectural benefits:

  • "These can be run on any laptop"

  • "0.8B and 2B for your phone"

  • "offline and open source"

As developer Karan Kendre of Kargul Studio said: "this model [can run] Locally for free on my M1 MacBook Air."

this feeling of "Wonderful" Accessibility is echoed throughout the developer ecosystem. One user noted that a 4B model is acting as a "strong multimodal base" there is one "Game changer for mobile developers" Those who need screen-reading capabilities without high CPU overhead.

Indeed, Hugging Face developer Zenova notes that the new Qwen3.5 Small model series can even run directly in the user’s web browser and perform sophisticated and previously high-compute-demanding operations such as video analysis.

Researchers also praised the release of the base model alongside Instructables editions, noting that it provides the necessary support "Real-world industrial innovation."

The release of the base model is particularly valued by enterprise and research teams because it provides a "blank slate" It is not biased by a specific set of RLHF (reinforcement learning from human feedback) or SFT (supervised fine-tuning) data, which can often lead to "denied" Or specific conversation styles that are hard to undo.

Now, with the base model in place, the starting point for those interested in customizing the model to suit specific tasks and purposes has become easier, as they can now implement their own instructions after tuning and training without deviating from Alibaba.

Licensing: A win for the open ecosystem

Alibaba has released the weights and configuration files for the Qwen3.5 series under the Apache 2.0 license. This permissive license permits commercial use, modification, and distribution without royalty payments. "vendor lock-in" Associated with proprietary API.

  • Commercial Use: Developers can integrate the models into commercial products royalty-free.

  • Amendment: Teams can fine-tune (SFT) or apply RLHF to create special versions.

  • Distribution: The models can be redistributed into local-first AI applications such as Olama.

Making news relevant: Why small issues matter so much right now

The release of Qwen3.5 mini series is just around the corner "Agentic realignment." We have moved beyond simple chatbots; Now the goal is autonomy. Must be an autonomous agent "Thinking" (Reason), "Look" (Multiversity), and "Work" (use of equipment). While doing this with trillion-parameter models is prohibitively expensive, a local Qwen3.5-9B can execute these loops for a fraction of the cost.

By scaling reinforcement learning (RL) to million-agent environments, Alibaba has perfected these small models "human-aligned decisions," Allowing them to handle multi-step tasks like organizing desktops or reverse-engineering gameplay footage into code. Whether it is a 0.8B model running on a smartphone or a 9B model powering a coding terminal, the Qwen3.5 series is effectively democratizing "Agentic era."

Changes from Qwen3.5 series "chatbits" To "native multimodal agent" It changes how enterprises can deliver intelligence. by advancing sophisticated logic "Edge"-Personal devices and local servers -Organizations can automate tasks that previously required expensive cloud APIs or high-latency processing.

Strategic Enterprise Applications and Ideas

The 0.8B to 9B models have been re-engineered for efficiency, using a hybrid architecture that activates only the essential parts of the network for each task.

  • Visual Workflow Automation: using the "pixel-level grounding," These models can navigate desktop or mobile UIs, fill out forms, and organize files based on natural language instructions.

  • Complex Document Parsing: With scores over 90% on document understanding benchmarks, they can replace separate OCR and layout parsing pipelines to extract structured data from diverse forms and charts.

  • Autonomous Coding and Refactoring: Enterprises can feed the entire repository (up to 400,000 lines of code) into a 1M context window for production-ready refactors or automated debugging.

  • Real-Time Edge Analysis: The 0.8b and 2b models are designed for mobile devices, enabling offline video summarization (up to 60 seconds at 8 fps) and spatial reasoning without putting a strain on battery life.

The table below outlines which enterprise functions can benefit most from local, small-model deployments.

Celebration

primary benefit

key use case

software engineering

local code intelligence

Repository-wide refactoring and terminal-based agentive coding.

Operations & IT

secure automation

Automating multi-step system settings and file management tasks locally.

Product & UX

edge interaction

Integrating native multimodal reasoning directly into mobile/desktop apps.

data analysis

efficient extraction

High-fidelity OCR and structured data extraction from complex visual reports.

Although these models are highly capable, they are small in size and "agentic" Nature introduces specific operations "flags" That teams should monitor.

  • Hallucination Cascade: in multi-step "agentic" Workflows, may cause a small error in the initial stage "waterfall" Failures occur where the agent adopts a wrong or redundant plan.

  • Debugging vs Greenfield Coding: While this model is expert in writing new "Greenfield" code, they may struggle with debugging or modifying existing, complex legacy systems.

  • Memory and VRAM demands: even "Small" Models (such as 9b) require significant VRAM for high-throughput inference; "memory footprint" Remains high because the total parameter count still takes up GPU space.

  • Regulatory and Data Residency: Using a China-based provider’s model may raise data residency questions in some jurisdictions, although the Apache 2.0 open-source version allows hosting "universal" Local cloud.

Enterprises should give priority "inspection" Tasks – such as coding, math, or following instructions – where output can be automatically checked against predefined rules to prevent "bounty hacking" Or silent failures.



<a href

Leave a Comment