Z.ai Debuts Open Source GLM-4.6V, A Native Tool-calling Vision Model For Multimodal Reasoning

Chinese AI startup Zipu AI aka Z.ai has released its GLM-4.6V seriesA new generation of open-source vision-language models (VLMs) optimized for multimodal reasoning, frontend automation, and high-efficiency deployment.

The release includes two models "Big" And "Small" size:

GLM-4.6V (106B)A large 106-billion parameter model aimed at cloud-scale inference
GLM-4.6V-Flash (9B)A small model of only 9 billion parameters, designed for low-latency, local applications

Recall that generally speaking, models with more parameters – or internal settings controlling their behavior, i.e. weights and biases – are more powerful, performant, and able to perform at a higher general level across more diverse tasks.

However, smaller models may provide better efficiency for edge or real-time applications where latency and resource constraints are critical.

This series introduces defining innovation native function calling In a vision-language model – enabling direct use of tools such as searching, cropping or chart recognition with visual input.

With a 128,000 token context length (equivalent to the text of a 300-page novel exchanged in a single input/output interaction with the user) and state-of-the-art (SOTA) results in over 20 benchmarks, the GLM-4.6v series positions itself as a highly competitive alternative to both closed and open-source VLMs. It is available in the following formats:

API access via OpenAI-compliant interface
Try a demo on Zipu’s web interface
Download Weight from Hugging Face
The desktop assistant app is available on Hugging Face Space

Licensing and enterprise use

GLM‑4.6V and GLM‑4.6V‑Flash are distributed under the MIT License, a permissive open-source license that permits free commercial and non-commercial use, modification, redistribution, and local deployment without obligation to create open-source derivative works.

This licensing model makes the chain suitable for enterprise adoption, including scenarios that require full control over infrastructure, compliance with internal governance, or air-gapped environments.

Model weights and documentation are publicly hosted on Hugging Face, with supporting code and tooling available on GitHub.

The MIT license ensures maximum flexibility for integration into proprietary systems, including internal tools, production pipelines, and edge deployments.

Architecture and Technical Capabilities

The GLM-4.6V models follow a traditional encoder-decoder architecture with significant optimization for multimodal inputs.

Both models include a vision transformer (VIT) encoder based on AIMv2-HUGE and an MLP projector to align visual features with a large language model (LLM) decoder.

Video inputs benefit from 3D convolution and temporal compression, while spatial encoding is handled using bicubic interpolation of 2D-ROPE and full positional embeddings.

A key technical feature is the system’s support for arbitrary image resolutions and aspect ratios, including wide panoramic inputs up to 200:1.

In addition to static image and document parsing, GLM-4.6V can ingest temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal reasoning.

On the decoding side, the model supports token generation aligned with function-calling protocols, allowing structured logic in text, image, and tool outputs. It is supported by extended tokenizer terminology and output formatting templates to ensure consistent API or agent compatibility.

Using Native Multimodal Tools

GLM-4.6V introduces native multimodel function calling, allowing visual assets – such as screenshots, images, and documents – to be passed directly as parameters to the tool. This eliminates the need for intermediate text-only conversions, which have historically created information loss and complexity.

The device invocation mechanism works bi-directionally:

Input tools can pass images or videos directly (for example, to crop or analyze document pages).
Output tools such as chart renderers or web snapshot utilities return visual data, which GLM-4.6V integrates directly into the logic chain.

In practice, this means that the GLM-4.6V can perform the following functions:

Generating structured reports from mixed format documents
Visual audit of candidate images
Automatically cutting figures from paper during generation
Conducting visual web searches and answering multimodal queries

High performance benchmarks compared to other similarly sized models

GLM-4.6V was evaluated in over 20 public benchmarks covering general VQA, chart understanding, OCR, STEM reasoning, frontend replication, and multimodal agents.

According to the benchmark chart released by Zipu AI:

GLM-4.6V (106B) achieves SoTA or near-SoTA scores among open-source models of comparable size (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench and others.
The GLM-4.6V-Flash (9B) outperforms other lightweight models (for example, Qwen3-VL-8B, GLM-4.1V-9B) in almost all categories tested.
The 128K-token window of the 106B model allows it to outperform larger models such as Phase-3 (321B) and Qwen3-VL-235B on long-context document tasks, video summaries, and structured multimodal logic.

Example scores from the leaderboard include:

MathVista: 88.2 (GLM-4.6V) vs 84.6 (GLM-4.5V) vs 81.4 (QWEN3-VL-8B)
WebVoyager: 81.0 vs 68.4 (QWEN3-VL-8B)
ref-l4-test: 88.9 vs 89.5 (GLM-4.5V), but with better grounding fidelity at 87.7 (flash) vs 86.8

Both models were evaluated using the VLLM estimation backend and SGLang support for video-based tasks.

Frontend Automation and Long-Context Workflows

Zipu AI emphasized the ability of GLM-4.6V to support frontend development workflow. The model can:

Replicate pixel-precise HTML/CSS/JS from UI screenshots
Accept natural language editing commands to modify layout
Visually identify and manipulate specific UI components

This capability is integrated into an end-to-end visual programming interface, where the model iterates over layout, design intent, and output code using its native understanding of screen captures.

In long-document scenarios, GLM-4.6V can process up to 128,000 tokens – enabling a single estimate to exceed:

150 pages of text (input)
200 slide deck
1 hour video

Zipu AI reported successful use of the model in summarizing full-length sports broadcasts with financial analysis and timestamped event detection in multi-document corpora.

training and reinforcement learning

The model was trained using supervised fine-tuning (SFT) and reinforcement learning (RL) followed by multi-stage pre-training. Major innovations include:

Curriculum Sampling (RLCS): Dynamically adjusts the difficulty of training samples based on model progress
Multi-domain reward system: task-specific validators for STEM, chart reasoning, GUI agents, video QA, and spatial grounding
Function-aware training: Structured tags to align arguments and answer formatting (e.g., <सोच>, <उत्तर>, <|begin_of_box|>) Uses up

The reinforcement learning pipeline emphasizes verifiable rewards (RLVR) over human feedback (RLHF) for scalability, and avoids KL/entropy loss to stabilize training in multimodal domains.

Pricing (API)

Zipu AI offers competitive pricing for the GLM-4.6V series, with both the flagship model and its lightweight version positioned for higher accessibility.

GLM-4.6V: $0.30 (input) / $0.90 (output) per 1M token
GLM-4.6V-Flash: Free

Compared with leading vision-enabled and text-first LLMs, GLM-4.6V is the most cost-effective for large-scale multimodal reasoning. Below is a comparative snapshot of pricing from different providers:

USD per 1M tokens – lowest sort → highest total cost

Sample	input	Production	total cost	Source
quen 3 turbo	$0.05	$0.20	$0.25	alibaba cloud
Ernie 4.5 Turbo	$0.11	$0.45	$0.56	qianfan
GLM‑4.6V	$0.30	$0.90	$1.20	Z.AI
grok 4.1 fast (logic)	$0.20	$0.50	$0.70	xai
grok 4.1 fast (non-argument)	$0.20	$0.50	$0.70	xai
DeepSeek-Chat (V3.2-Exp)	$0.28	$0.42	$0.70	deepseek
DeepSeek-Reasoner (V3.2-Exp)	$0.28	$0.42	$0.70	deepseek
queen 3 plus	$0.40	$1.20	$1.60	alibaba cloud
Ernie 5.0	$0.85	$3.40	$4.25	qianfan
quen-max	$1.60	$6.40	$8.00	alibaba cloud
GPT-5.1	$1.25	$10.00	$11.25	OpenAI
Gemini 2.5 Pro (≤200K)	$1.25	$10.00	$11.25	Google
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Google
Gemini 2.5 Pro (>200K)	$2.50	$15.00	$17.50	Google
Grok 4 (0709)	$3.00	$15.00	$18.00	xai
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Google
cloud opus 4.1	$15.00	$75.00	$90.00	anthropic

Previous Release: GLM‑4.5 Series and Enterprise Applications

Prior to GLM‑4.6V, Z.ai released the GLM‑4.5 family in mid-2025, establishing the company as a serious contender in open-source LLM development.

The flagship GLM‑4.5 and its smaller brother GLM‑4.5‑Air both support logic, device usage, coding, and agentic behavior, offering strong performance in standard benchmarks.

The model introduced dual logic modes (“thinking” and “non-thinking”) and could automatically generate entire PowerPoint presentations from a single prompt – a feature for use in enterprise reporting, education, and internal comms workflows. Z.ai also extended the GLM‑4.5 series with additional variants such as GLM‑4.5‑X, AirX and Flash, targeting ultra-fast inference and low-cost scenarios.

Together, these features position the GLM‑4.5 Series as a cost-effective, open, and production-ready option for enterprises requiring autonomy over model deployment, lifecycle management, and integration pipelines.

ecosystem implications

The GLM-4.6V release represents significant progress in open-source multimodal AI. While large vision-language models have proliferated over the past year, some proposals:

integrated visualization tool usage
structured multimodal generation
Agent-oriented memory and decision reasoning

Zipu AI’s emphasis on “closing the loop” from perception to action through native function calling is a step toward agentic multimodal systems.

The model’s architecture and training pipeline reflect the continued evolution of the GLM family, positioning it competitively with offerings such as OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.

Takeaways for Enterprise Leaders

With GLM-4.6V, Zhipu AI introduces an open-source VLM capable of native visual tool use, long-context reasoning, and frontend automation. It sets new performance benchmarks among similarly sized models and provides a scalable platform for building agentic, multimodal AI systems,

<a href

Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

Licensing and enterprise use

Architecture and Technical Capabilities

Using Native Multimodal Tools

High performance benchmarks compared to other similarly sized models

Frontend Automation and Long-Context Workflows

training and reinforcement learning

Pricing (API)

Previous Release: GLM‑4.5 Series and Enterprise Applications

ecosystem implications

Takeaways for Enterprise Leaders

Like this:

Related

Leave a Comment Cancel reply

Licensing and enterprise use

Architecture and Technical Capabilities

Using Native Multimodal Tools

High performance benchmarks compared to other similarly sized models

Frontend Automation and Long-Context Workflows

training and reinforcement learning

Pricing (API)

Previous Release: GLM‑4.5 Series and Enterprise Applications

ecosystem implications

Takeaways for Enterprise Leaders

Share this:

Like this:

Related

Leave a Comment Cancel reply