AI Products & Strategy April 19, 2026 · 17 min read

Gemma 4, Decoded: Why Google Released It Free and How It Actually Works

Google released Gemma 4 under Apache 2.0 on April 2, 2026. The license change is the real story, not the benchmarks. This is the three-tier framework for 'open' AI (closed, open-weight, open-source), a technical breakdown of how Gemma 4's MoE and multimodal pipeline work, and a practitioner decision flow for picking the right tier.

By Vikas Pratap Singh

#gemma-4 #open-weight #open-source #ai-strategy #llm-architecture #licensing #ai-governance

Executive Briefing

What this covers: A decoded view of Gemma 4 across three dimensions. The strategy (why Google released it free and what the Apache 2.0 license change actually signals), the architecture (MoE routing, interleaved attention, multimodal encoders, Gemini-to-Gemma distillation), and the practitioner framework (when to pick closed, open-weight, or truly open-source).
Who should read it: VPs of Data and AI, Principal Data Architects, Platform leaders picking foundation models, and Governance/Risk leads who need a defensible framework for 'is this model safe to build on?' conversations.
Key takeaway: 'Open' is a three-tier category, not a binary. Closed (GPT-5, Claude Opus, Gemini) keeps weights private. Open-weight (Gemma 4, Llama 4, Qwen, Mistral, DeepSeek) releases weights but not training data or code. OSI-compliant open-source (OLMo 3, Pythia, BLOOM) releases all three. Gemma 4 is now in tier two with a genuinely permissive license, a meaningful shift, but still not tier three.
Bottom line: Google did not release Gemma 4 for free out of generosity. It released Gemma 4 to deny Meta the open-weight monopoly, match Chinese permissive licensing, and funnel developers into Google Cloud. The practitioner's job is not to celebrate the free model; it is to choose the right tier for the workload and read the license before the vendor reads it to you.

The Headline That Got the Story Wrong

I will admit something before getting into the framework. Until I sat down to research Gemma 4, I was sloppy with the terms “open source” and “open weight.” I used them interchangeably. So does most of the AI press. So does most of the vendor marketing. Gemma 4’s release under Apache 2.0 was what finally forced me to read the licenses, the OSI’s definition, and what each major lab is actually releasing, and the difference between those two terms turned out to be more consequential than the convenient shorthand suggested. This article is the version of that research I wish I had read first.

On April 2, 2026, Google released Gemma 4 in four sizes and called them its “most capable open models to date.” The coverage that followed focused, predictably, on the benchmark numbers. Gemma 4’s 31B dense model scores 89.2 on AIME 2026 math, 80.0 on LiveCodeBench, and a 2150 Codeforces ELO, numbers that put it ahead of Llama 4 on several axes and within reach of closed frontier models on others.

The benchmarks are not the story. The story is the license.

Gemma 1, 2, and 3 shipped under a custom “Gemma Terms of Use” that gave Google the unilateral right to “restrict usage” and included a Prohibited Use Policy covering financial, legal, medical, and other sensitive domains. Enterprise legal teams routinely flagged that license as ambiguous. Gemma 4 ships under standard Apache 2.0. No custom clauses. No harmful-use carve-outs. No MAU thresholds. No “Google reserves the right to turn this off” language. Nathan Lambert, writing in Interconnects, put it plainly: “I will personally be so happy if the horrible Llama licenses and Gemma terms of service were an ~18-month transient dynamic of the industry being nervous about releasing strong open models.”

VentureBeat’s coverage agreed. The license change may matter more than the benchmarks.

For a Data or AI leader deciding what to build on, three questions follow. What does “open” actually mean in 2026, now that every major lab has a version of it? What does Gemma 4 do under the hood that justifies building on it rather than the alternatives? And why would Google, which spends nine figures to train frontier models, give one away?

This article answers all three. Section one is the framework (three tiers of “open”). Section two is the architecture (how Gemma 4 works). Section three is the strategy (why every major lab is now giving models away). Section four is the practitioner’s decision flow. You should finish this article able to read any “we’re releasing an open model” announcement in a single pass and know what is actually being offered.

Part 1: The Three Tiers of “Open” AI

The word “open” has been stretched so far in AI marketing that it means almost nothing. A model that releases only its weights is called “open.” A model that releases weights with a 700 million monthly active user cap is called “open.” A model that releases weights, training code, training data, and an OSI-approved license is also called “open.” These three things are not the same.

In October 2024, the Open Source Initiative published version 1.0 of its Open Source AI Definition after a two-year global consultation. It drew a hard line. To qualify as open-source AI, a model has to clear three bars at once: enough information about the training data that a skilled practitioner could build a “substantially equivalent system,” the complete training and inference code under an OSI-approved license, and the model parameters themselves under OSI-approved terms. TechCrunch’s coverage put the gap in concrete terms: Meta still calls Llama “open source” despite the OSI’s objection to its 700M MAU clause, while Google and Microsoft agreed to drop the term for models that do not fully meet the definition.

That leaves three tiers.

The three tiers of 'open' AI: closed-source, open-weight, and OSI-compliant open-source

Closed-source (API-only). Weights are not released. The model is served via paid API. You do not know what it was trained on. You cannot self-host. Examples in April 2026: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro. Pricing runs from $2 per million input tokens on the low end (Gemini 3.1 Pro) to $75 per million output tokens at the top (Claude Opus 4.7). The frontier stays here.

Open-weight. Weights are downloadable. Training data and training code are not. License varies dramatically. At the permissive end: Apache 2.0 (Gemma 4, Qwen 3.5, Mistral Small 4, Mistral Large 3) or MIT (DeepSeek V3.2). At the restrictive end: Llama 4’s Community License, which caps commercial use at 700 million monthly active users, restricts multimodal rights for EU-domiciled entities, and carries an Acceptable Use Policy with enumerated prohibited categories. Self-hosting is possible. Full reproduction is not.

Open-source (OSI-compliant). Weights, training code, data documentation, and an OSI-approved license. The only mainstream example at frontier-adjacent scale is the OLMo family from the Allen Institute for AI, which ships models, training code, evaluation suite, and training data under Apache 2.0. OLMo 3.1 includes 32B checkpoints (Think and Instruct variants). Pythia and BLOOM occupy the same category at smaller scale. The OSI has publicly confirmed OLMo meets the definition. These models are rare precisely because releasing training data is legally and operationally expensive, not because it is impossible.

The agent-era restatement. When you build an AI product on a foundation model, you are not just renting capability. You are renting a license, inheriting a data-provenance story, and committing to a vendor’s definition of “acceptable use.” Reading the license is the same skill as reading the data contract. If you cannot tell which tier your model sits in, you do not know what you are committed to.

The three tiers matter for four concrete reasons. Reproducibility: only OSI-open-source models can be rebuilt from scratch. Auditability: only OSI-open-source models let you inspect training data for PII, copyright, or bias. Regulatory posture: the EU AI Act’s General-Purpose AI provisions, which begin enforcement August 2, 2026, grant open-weight and open-source providers partial exemptions when weights are public and commercial use is permitted, but still require training-data summaries for providers of models with systemic risk. Strategic lock-in: if a vendor changes license terms on the next release (Google just did the opposite, toward more permissive), your pipeline is the only continuity.

A quick vocabulary table for the rest of this article:

Tier	What’s released	Examples	Key license examples
Closed	Nothing	GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro	Proprietary API only
Open-weight	Weights only	Gemma 4, Llama 4, Qwen 3.5, Mistral, DeepSeek	Apache 2.0, MIT, Llama Community, custom
Open-source (OSI)	Weights + code + data info	OLMo 3, Pythia, BLOOM	Apache 2.0 across all components

Gemma 4 sits firmly in tier two. The license upgrade to Apache 2.0 moves it from the restrictive end of tier two to the permissive end of tier two. It does not move it into tier three, because Google has not released Gemma 4’s training data or data-preparation code.

Part 2: How Gemma 4 Actually Works

The Gemma 4 release is not one model; it is four, each tuned for a different deployment envelope. Understanding what each one is optimized for matters more than memorizing parameter counts.

Variant	Total params	Active params	Typical use
Gemma 4 E2B	2.3B effective	2.3B	Phone, edge, on-device
Gemma 4 E4B	4.5B effective	4.5B	Laptop, workstation
Gemma 4 26B MoE	26B	3.8B active	Single-GPU server (with quantization)
Gemma 4 31B Dense	31B	31B	Fine-tuning foundation, multi-GPU inference

Context windows: 128K tokens on E2B and E4B, 256K on the 26B and 31B models. Multimodality: text and vision across all four, audio on the edge models (E2B, E4B) only. Native training across 140+ languages with out-of-the-box support for 35+. All four models are decoder-only transformers. All four are distilled from Gemini 3, meaning Google used its closed frontier model as the “teacher” and trained Gemma 4 (the “student”) to match the teacher’s output probability distributions. That distillation is how a 31B parameter model approaches frontier-adjacent quality on several benchmarks: the knowledge sits in the teacher, and the student inherits it at smaller size.

Three architectural choices in Gemma 4 are worth understanding, because they are where Google made non-obvious trade-offs.

Choice 1: Mixture-of-Experts on the 26B Model

The 26B variant is a Mixture-of-Experts (MoE) model. The entire 26 billion parameters must be loaded into memory, but only about 4 billion are used to compute each token. This is not a new idea (DeepSeek and Qwen have shipped MoE variants for a while), but it is new for the Gemma family.

Gemma 4 26B MoE architecture: router selects top-8 of 128 experts plus a shared always-on expert, activating ~4B of 26B parameters per token

Here is how the routing works, specifically. For each token, the model produces a hidden state from the attention block. A small learned linear projection (the “router”) scores that hidden state against 128 experts and applies a softmax. The top 8 experts are selected. There is also a shared expert that is always activated, regardless of routing score, which processes every token. The outputs of the shared expert and the 8 routed experts are combined with a weighted sum. The remaining 120 experts are still loaded in GPU memory but do not do any computation on this token.

The practical consequence: the 26B MoE delivers roughly 4B-parameter latency while holding 26B parameters worth of specialized knowledge. On AWS, a g5.2xlarge instance costs about $1.21 per hour on-demand and is enough to run INT4-quantized Qwen 3.6-35B; Gemma 4 26B MoE fits a similar envelope. The MoE design is what lets Gemma 4 compete with much larger dense models without requiring multi-GPU hosting.

Choice 2: Interleaved Local and Global Attention

All four Gemma 4 models alternate between two types of attention layers in every block. Some layers use local sliding-window attention, where each token only attends to a fixed window (512 tokens on E2B/E4B, 1024 tokens on the 26B and 31B). Other layers use global full-context attention, where each token attends to every other token in the sequence. The model interleaves them.

This is a memory-efficient compromise. Full attention at every layer would blow up memory use at the model’s 256K context. No global attention anywhere would cripple the model’s ability to reason across long spans. The interleaved design reduces KV-cache memory pressure while keeping long-range reasoning.

A related choice: Gemma 4 uses dual RoPE (Rotary Position Embeddings). Sliding-window layers get standard RoPE. Global layers get a “pruned” RoPE tuned for long contexts. This is the same technique used in Gemini to make 256K+ context windows practical.

Choice 3: Per-Layer Embeddings on Edge Models

The E2B and E4B models add a feature the larger models do not: Per-Layer Embeddings (PLE). For every token, the model produces a small dedicated vector for every layer, not one embedding shared across all layers. The layer-specific vector combines two signals: a token-identity component (from an embedding lookup) and a context-aware component (from a learned projection of the main embedding).

The result is a tiny model (2-4B parameters) that performs better on downstream tasks than its parameter count suggests, because information flows more richly through each layer. Per-Layer Embeddings are why a phone-scale model can punch above its parameter count on practical tasks.

The Multimodal Pipeline

Gemma 4 is not a text-only model. Input can be text in any of 140+ languages, images or video frames, and (on edge models) audio. Each modality flows through its own encoder before the tokens hit the shared transformer backbone.

Gemma 4 multimodal pipeline: text, image, and audio flow through their own encoders into the shared transformer, which was distilled from Gemini 3

The vision encoder is a Vision Transformer (ViT) with multidimensional RoPE, preserves aspect ratios, and can emit images at variable token budgets: 70, 140, 280, 560, or 1120 tokens per image depending on how much detail the downstream task needs. A thumbnail classifier does not need 1120 tokens; a chart-reading agent does.

The audio encoder is a Conformer, which is a Transformer encoder augmented with convolutional modules to capture local acoustic structure. Gemma 4’s audio encoder runs on 40ms frames and is 50% smaller than the Gemma 3N equivalent, meaning faster transcription on edge devices.

The tokenizer converts text into subword tokens using SentencePiece, consistent with the Gemini family.

All modality encoders emit tokens into the same sequence that the transformer processes. This is what “natively multimodal” means: there is no separate vision pipeline stitched onto a text model; all modalities share the same attention mechanism and the same distilled Gemini-3 knowledge.

What this looks like in practice. A Gemma 4 26B MoE deployment on a single GPU can classify images, transcribe short audio (E2B/E4B sizes only), answer multilingual questions, and generate text, all in the same inference call. For an enterprise use case like customer support, this collapses what used to be four separate model endpoints into one. The governance question shifts from “how do we manage four models?” to “how do we log and evaluate four modalities going through one model?”

Part 3: Why Google Is Giving This Away

Training a frontier model now costs $78 million to $191 million. Epoch AI estimates that GPT-4 cost about $78 million, Gemini Ultra about $191 million, and Llama 3.1 405B about $170 million. The trajectory points past $1 billion per frontier run by 2027. Google does not spend nine figures on an asset and release it free unless the release is itself a strategy.

Seven motivations, stacked, explain why every major lab now has an open-weight family.

1. Deny a monopoly on the free tier. Meta’s Llama became the default open-weight model through 2024 and 2025. If every enterprise’s on-prem AI strategy defaulted to Llama, Meta (not Google) would capture the ecosystem: fine-tunes, tooling, optimizations, hiring pipelines. Gemma exists so that Llama does not run the open tier unopposed. Microsoft’s Phi family plays a similar role at the small-model end.

2. Funnel developers into Google Cloud. Open-weight users eventually need fine-tuning compute, hosting, evaluation, and monitoring. Google wants those workloads on Vertex AI, on TPUs, on Google Cloud. Gemma is the top of a funnel whose bottom is cloud revenue. The Google Cloud blog explicitly promotes Gemma 4 on Google Cloud infrastructure. The pattern is standard: make the upstream asset free, charge for the downstream infrastructure.

3. Match Chinese permissive licensing. Qwen 3.5 ships Apache 2.0. DeepSeek V3.2 ships MIT. For 18 months, Chinese developers set a new permissive-license bar that Meta’s custom Llama license and Google’s custom Gemma license failed. Nathan Lambert’s core argument about Gemma 4 is that the Apache 2.0 shift is Google responding to that pressure. Staying on a custom license meant ceding the “US-origin, Apache 2.0, frontier-adjacent” slot to no one. Google took that slot.

4. Shape the regulatory narrative. The EU AI Act’s GPAI provisions grant open-weight providers partial exemptions from downstream-documentation obligations when weights are public and commercial use is permitted. Having a serious open-weight family gives Google a seat at the “open is good for safety research” table and a hedge if regulation lands harder on closed frontier models. Open is a regulatory posture, not only a technical one.

5. The Zuckerberg logic: commoditize the complement. Meta’s explicit rationale, articulated by Mark Zuckerberg in his July 2024 post “Open Source AI Is the Path Forward,” is that “selling access to AI models isn’t our business model” and that open-sourcing Llama “doesn’t undercut our revenue.” In his April 2024 interview with Dwarkesh Patel, Zuckerberg argued that Meta’s competitive position lives in “app specific work” built on top of foundation models, not in the foundation models themselves. The strategic logic: AI is a complement to Meta’s ad business, so commoditizing the foundation model destroys competitors’ per-token pricing power without touching Meta’s revenue engine. Every free Llama token is a token OpenAI and Anthropic cannot charge for. Google’s Gemma plays a similar role against OpenAI and Anthropic, though Google’s core business (search + cloud) benefits in a different way from Meta’s (ads + social).

6. Recruiting and research prestige. High-quality open weights signal to ML talent: “we are the lab that shares.” In a market where Meta, Mistral, and DeepSeek compete for the same researchers, the recruiting pipeline matters.

7. Feedback loops and free R&D. Community fine-tunes, quantizations, and evaluation tooling flow back. The open-weight ecosystem does research Google would otherwise fund internally.

Company	Open-weight model	License	Primary motive
Meta	Llama 4, Llama 5	Llama Community (700M MAU cap, EU multimodal restriction, AUP)	Commoditize the complement to ads
Google	Gemma 4	Apache 2.0	Deny Meta’s monopoly, funnel to Vertex/GCP, match Chinese licensing
Alibaba	Qwen 3.5, 3.6	Apache 2.0 (flagship)	Export Chinese AI influence, compete with Llama on license terms
DeepSeek	V3.2, V4	MIT	Attention, recruiting, geopolitical signaling
Mistral	Small 4, Large 3	Apache 2.0	European sovereignty story
Microsoft	Phi family	MIT	Edge/on-device story, complement OpenAI partnership
Allen Institute	OLMo 3	Apache 2.0 + data + code	Research mission, OSI-compliant open-source

One pattern is worth naming directly: no closed-frontier lab releases its best model. Gemini 3.1 Pro stays closed. GPT-5.4 stays closed. Claude Opus 4.7 stays closed. (Mistral is the partial exception; Mistral Large 3 is Apache 2.0, but Mistral does not compete at the absolute frontier in the same way.) For the labs that own the frontier, open weights are always a tier below the closed flagship. The “we are opening up AI” narrative is precisely calibrated to release the second-best thing at no strategic cost.

Key insight: Altruism is not the default driver, even for genuinely useful releases. The exception that proves the rule is OLMo 3.1: the one family of OSI-compliant open-source models at meaningful scale comes from a nonprofit research institute (Ai2), not from a commercial lab. When a commercial lab releases weights, ask what product the weights are a complement to; that is the revenue model, and the weights are subsidizing it.

Part 4: What a Practitioner Actually Does With This

The practical question is not “is Gemma 4 good?” (it is, on most benchmarks, for its size). The practical question is “for my workload, with my constraints, which tier should I pick, and within that tier, which model?”

Decision flow for picking a model tier: regulatory, data residency, volume, and capability constraints route you to closed API, open-weight managed host, open-weight self-hosted, or OSI-open-source

Four practitioner scenarios, each with a concrete recommendation.

Scenario A: Customer support agent for a regulated industry (finance or healthcare). Data cannot leave your network. Regulators want to see training-data provenance. Your governance team is likely to demand reproducibility. This is the rare case where OSI-compliant open-source actually matters. Pick OLMo 3.1 32B, accept the benchmark cost (it trails Gemma 4 on raw capability), and gain full auditability. If OLMo’s quality is insufficient, pick Gemma 4 31B under Apache 2.0, document the data-provenance gap explicitly, and build a risk register entry for it.

Scenario B: Developer tool or internal coding assistant, unregulated, 500M+ tokens per month. The cost case for self-hosting beats API pricing at your volume. Pick Gemma 4 26B MoE (Apache 2.0) or Qwen 3.6-35B-A3B (Apache 2.0) on your own GPUs. Qwen currently leads on SWE-bench (73.4% verified) at 3B active parameters, which matters for coding workloads. Gemma 4 leads on competitive-programming metrics (LiveCodeBench 80%, Codeforces 2150 ELO) but trails Qwen on real-world software engineering. Pick by workload, not by brand.

Scenario C: Small startup, need best capability, low initial volume. Self-hosting does not pay below a few hundred million tokens per month. Pick a closed-frontier API (GPT-5.4 at $2.50/$15 per million input/output, Gemini 3.1 Pro at $2/$12, or Claude Opus 4.7 at $15/$75). Treat the API bill as a variable cost and revisit the self-host calculus when volume crosses 300M tokens per month.

Scenario D: On-device AI in a consumer product. Network round-trips are unacceptable. Model must run offline. Pick Gemma 4 E2B or E4B. The Per-Layer Embeddings architecture is specifically designed for this envelope. Audio support (transcription, voice commands) is built in. Verify memory budget on target hardware before committing.

In every scenario, the real gating step happens before you pick the model: you read the license. Llama 4’s 700M MAU cap is live. Llama 4’s EU multimodal restriction is live. Gemma 1-3’s custom license is still live on deployed systems; if you are running Gemma 3 in production, your license is the Gemma Terms of Use, not Apache 2.0. The license travels with the weights you downloaded, not with the model family name.

How to read an open-weight license in five minutes. Four questions. (1) Is the license OSI-approved? Apache 2.0 and MIT are. “Community license” variants from Meta and older Gemma terms are not. (2) Is there a user-threshold cap? Llama’s 700M MAU is the most famous; read carefully. (3) Is there a jurisdiction restriction? Llama 4 excludes EU-domiciled entities from multimodal rights. (4) Can the licensor retract or modify? Older Gemma licenses gave Google unilateral restriction rights. If any answer reveals a restriction you cannot live with, the model is not free for you.

What to Do Next

Priority	Action	Why it matters
P0	Read every license currently governing models in production. Map each to one of three tiers (closed, open-weight, OSI).	You cannot manage what you have not classified. The license change in Gemma 4 just demonstrated that terms move; yours could have moved without you noticing.
P0	Write down the three-tier framework as an internal procurement gate. Require that every new foundation-model selection document the tier and the license-risk position.	Turns a one-off decision into a repeatable governance practice. Stops the “open means safe” reflex before it causes contractual pain.
P1	If you are running Llama 4 in production, verify the 700M MAU cap and the EU multimodal restriction against your business plan.	These are contractual obligations most enterprises have not actually read. If you are nearing the cap or operating in the EU on multimodal, you are exposed.
P1	Evaluate Gemma 4 26B MoE against your current open-weight stack (Llama 4, Qwen 3.5). Compare on your workload’s benchmarks, not on press-release numbers.	The MoE architecture genuinely changes deployment economics; real capability is workload-specific.
P2	Add the EU AI Act General-Purpose AI deadline (August 2, 2026) to your risk register. Decide whether your open-weight models qualify for the exemption.	Open-weight providers who release training-data summaries and meet the systemic-risk threshold have specific obligations; your usage may inherit some of them.
P2	Run one pilot on an OSI-compliant open-source model (OLMo 3.1 32B is the current benchmark-adjacent option) to understand what “full provenance” looks like in practice.	Even if you never ship OLMo, knowing what full auditability costs in capability terms sharpens every future trade-off.
P3	Brief your legal and procurement teams on the three tiers before the next vendor conversation.	Vendors say “open.” Your team should reply with “which tier?”

The Bottom Line

Google released Gemma 4 free on April 2, 2026, under Apache 2.0. That license change is a genuine shift, and it makes Gemma 4 materially more usable than its predecessors for regulated enterprises. But Gemma 4 is still open-weight, not open-source by the OSI’s definition. Training data and data-preparation code are not released. Google still controls the frontier through Gemini 3.1 Pro, which stays closed and API-gated.

The release is not generosity. It is a calibrated move to deny Meta the open-weight monopoly, match Chinese permissive licensing, funnel developers into Google Cloud, and shape EU AI Act regulatory positioning. Every major lab has a version of this play. Meta’s version is “commoditize the complement.” Alibaba’s version is “export Chinese AI influence through permissive licensing.” Mistral’s version is “European sovereignty.” Microsoft’s version is “edge/on-device.” Only Ai2’s OLMo is genuinely open-source by the OSI definition; it is a nonprofit, and that is not a coincidence.

For a practitioner, the lesson is narrower and more useful. “Open” is three tiers, not one. Read the license. Pick the tier that matches your regulatory, data-residency, volume, and capability constraints, in that order. The free model is only free if its license agrees with the business you are actually running.