Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

ChatGPT Image Apr 2 2026 01 33 47 PM
For the past two years, enterprises evaluating open-weight models have faced a strange trade-off. Google’s Gemma line delivered consistently strong performance, but its custom license — with usage restrictions and terms that Google could update at will — pushed many teams toward Mistral or Alibaba’s Quon. Legal review increased friction. Compliance teams flagged edge cases. And as Gemma 3 was capable, "open" Not the same as open with an asterisk.

Gemma 4 eliminates that friction completely. Google DeepMind’s latest open model family belongs to a standard Apache 2.0 License – Quen, Mistral, Arcee and similar permissive terms used by most open-web ecosystems.

No custom clause, no "harmful use" Carvings that require legal interpretation have no restrictions on redistribution or commercial deployment. For enterprise teams that have been waiting for Google to play on the same licensing terms as the rest of the sector, their wait is over.

The timing is remarkable. As some Chinese AI labs (notably Alibaba’s latest Quen models, Quen3.5 Omni and Quen 3.6 Plus) have begun to step back from fully open releases for their latest models, Google is moving in the opposite direction – opening up its most capable Gemma release to date while also clearly stating that the architecture is derived from its commercial AI labs. gemini 3 Research.

Four models, two levels: from shore to workplace in the same family

Gemma 4 comes as four different models organized into two deployment tiers. "work station" Tier includes a 31b-parameter dense model and a 26B A4B Mixture-of-Experts Model – Support for both text and image input with a 256K-token context window. "Edge" Tier includes e2b And E4BCompact models designed for phones, embedded devices and laptops, supporting text, image and audio with a 128K-token context window.

Naming conventions require some unpacking. "E" Prefix denotes "effective parameters" – E2B has 2.3 billion effective parameters, but 5.1 billion total, because each decoder layer keeps its own small embedding table through a technique called Google. Per-Layer Embedding (PLE). These tables are large on disk but cheap to compute, which is why the model runs like 2B while technically being overweight.

"A" A4B means in 26B "active parameters" – Of the 25.2 billion total parameters of the MoE model, only 3.8 billion are activated during inference, meaning it provides about 26B class of intelligence with a computation cost comparable to the 4B model.

For IT leaders shaping GPU requirements, this translates directly into deployment flexibility. The MoE model can run on consumer-grade GPUs and should display quickly in tools like Olama and LM Studio. 31b dense models require more headroom – think NVIDIA H100 or RTX 6000 Pro for undeniable inference – but Google is also shipping Quantization-aware training (QAT) checkpoints To maintain quality at low precision. On Google Cloud, both Workstation models can now run in a completely serverless configuration cloud run NVIDIA RTX Pro 6000 with GPU, clocks in at zero when idle.

MoE bet: 128 small experts to save estimation costs

26B Architectural options within the A4B model deserve special attention from teams evaluating estimate economics. Rather than following the pattern of recent large MoE models that used a handful of experts, Google adopted 128 little expertsActivating eight per token and one shared Always-On Expert. The result is a model that benchmarks competitively with dense models in the 27B-31B range while running at approximately the speed of a 4B model during inference.

This isn’t just a benchmark curiosity – it directly impacts service costs. A model that provides 27B-class logic at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference to produce. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family.

Both workstation models use one hybrid meditation system Which combines local sliding window attention with full global attention, the last layer is always global. This design enables a 256K context window while keeping memory consumption manageable – an important consideration for teams processing long documents, codebases or multi-turn agent conversations.

Native multimodality: vision, audio, and function calling built from scratch

Previous generations of open models generally treated multimodality as an add-on. Vision encoders were bolted to the text backbone. Audio requires an external ASR pipeline like Whisper. Function calling relied on quick engineering and the expectation that the model would cooperate. Gemma 4 integrates all these capabilities at the architectural level.

all four models handle Variable aspect-ratio image input With configurable visual token budget – a meaningful improvement over Gemma 3n’s older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets ranging from 70 to 1,120 tokens per image, allowing developers to exchange details against calculations depending on the task.

Works for low-budget classification and captioning; Higher budgets handle OCR, document parsing and microscopic visual analysis. Multi-image and video inputs (processed as frame sequences) are natively supported, enabling visual reasoning across multiple documents or screenshots.

add two edge models native audio processing – Automatic speech recognition and speech-to-translate-text, across all devices. The audio encoder has been compressed from 681 million to 305 million parameters in Gemma 3n, while the frame duration has been reduced from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local – think healthcare, field service, or multilingual customer contact – running ASR, translation, logic, and function calling in a single model on the phone or edge device is a real architectural simplification.

function calling Based on Google’s research, all four models also have native FunctionGemma Released late last year. Unlike previous approaches, which relied on instruction-following to engage models in structured tool use, Gemma 4’s function calling was trained into models from the ground up – optimized for multi-turn agentic flows with multiple tools. This is visible in agentic benchmarks, but more importantly, it reduces the instant engineering overhead that enterprise teams typically invest when building tool-using agents.

Benchmarks in context: Where Gemma 4 stacks up in a crowded field

The benchmark numbers tell a clear story of generational improvement. 31b dense model score 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6and hits one Codeforce ELO of 2,150 – Numbers that would have been marginal range from proprietary models a long time ago. On Vision, MMMU Pro reaches 76.9% and MATH-Vision reaches 85.6%.

For comparison, the Gemma 3 27B scored 20.8% on AIME and 29.1% on unclocked mode on LiveCodeBench.

The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond – an undergraduate science reasoning benchmark. The performance difference between MoE and dense variants is modest given the significant estimation cost advantage of the MoE architecture.

Edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench – strong for a model running on a T4 GPU. E2B, which is smaller still, manages 37.5% and 44.0% respectively. Despite being a fraction of the size both outperform the Gemma 3 27B (without a second) on most benchmarks, thanks to the built-in logic capability.

These figures need to be read against the increasingly competitive open-weight landscape. The Kwen 3.5, GLM-5, and KM K2.5 all compete aggressively in this parameter range, and the field moves forward rapidly. What sets Gemma 4 apart is less than a single benchmark and more a combination: robust logic, native multimodality across text, vision and audio, function calling, 256K contexts, and truly permissive licenses – all in a single model family with deployment options from edge devices to cloud serverless.

What should enterprise teams look forward to?

Google is releasing both pre-trained base models and instruction-tuned variants, which is important for organizations planning to fine-tune for specific domains. Gemma base models have historically been a strong basis for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially.

The serverless deployment option via Cloud Run with GPU support is worth looking into for teams that need inference capabilities that scale to zero. Paying only for the actual computation during inference – rather than maintaining always-on GPU instances – can meaningfully change the economics of deploying open models in production, especially for internal devices and low-traffic applications.

Google has hinted that this may not be the entire Gemma 4 family, with additional model sizes likely to come. But the combination available today – the workstation-class reasoning model and the edge-class multimodel model, all under Apache 2.0, all drawing from Gemini 3 research – represents the most complete open model release ever released by Google. For enterprise teams that have been waiting for Google’s open model to compete on licensing terms as well as performance, evaluation can finally begin without legal calls.



<a href

Leave a Comment