Google Doesn't Pay The Nvidia Tax. Its New TPUs Explain Why.

Every Frontier AI lab is rationing two things right now: electricity and computing. Most of them buy their computers for model training from a single supplier at huge margins, which has turned Nvidia into one of the most valuable companies in the world. Google doesn’t.

On Tuesday night, inside a private gathering at F1 Plaza in Las Vegas, Google previewed its eighth generation of tensor processing units. The pitch: Two custom silicon designs shipping later this year, each purpose-built for a different half of modern AI workloads. The TPU 8t targets training for frontier models, and the TPU 8i targets the low-latency, memory-hungry world of agentic inference and real-time sampling.

Google’s SVP and chief technologist of AI and infrastructure Amin Vahdat (pictured above left) used his time on stage to make a point that matters more to enterprise buyers than any individual specifics: Google designs every layer of its AI stack from start to finish, and vertical integration is starting to show up in cost-per-token economics, which Google says its rivals can’t match.

"One chip per year was not enough": Inside Google’s 2024 bet on a two-chip roadmap

The more interesting story behind the v8t and v8i is when the decision was made to split the roadmap. According to Vahdat, the call came in 2024 – a year before the industry introduced large-scale reasoning models, agents and reinforcement learning as key frontier workloads.

At the time, it was a contrary lesson. "We realized two years ago that one chip a year would not be enough," Vahdat said during the firefight. "This is actually our first attempt to go with two super high powered specialized chips."

For enterprise buyers, the implication is solid. Customers running fine-tuning or large-scale training on Google Cloud and serving production agents Vertex AI They are hiring the same accelerators and eating inefficiencies. The V8 is the first generation where the silicon itself treats different problems with two sets of chips.

TPU 8t: a training fabric that holds up to a million chips

On paper, the TPU 8t is an aggressive generational step forward. According to Google, the 8t delivers 2.8x FP4 EFlops per pod (121 vs. 42.5) versus Ironwood, a seventh-generation TPU slated to ship in 2025, doubling bidirectional scale-up bandwidth to 19.2 Tb/s per chip, and quadrupling scale-out networking to 400 Gb/s per chip. The pod size increases modestly from 9,216 to 9,600 chips when put together using Google’s 3D torus topology.

The number that matters most to IT leaders evaluating where to run frontier-scale training: 8t clusters (SuperPods) can scale to more than 1 million TPU chips in a single training job through a new interconnect that Google is calling Virgo Networking.

The 8t also introduces TPU Direct Storage, which moves data from Google’s managed storage tier directly into HBM without the usual CPU-mediated hops. For long training where wall-clock time is the cost driver, collapsing that data path reduces the number of pod-hours required to finish each epoch.

TPU 8i and Boardfly: Re-engineering the network for agents

If the 8t is an evolutionary step, the TPU 8i is an architecturally more interesting chip. This is also where the story becomes most compelling for IT buyers.

As Vahdat said, the year-over-year jump in exclusivity is “astonishing.” According to Google, the 8i offers 9.8x FP8 EFlops per pod (11.6 vs 1.2), 6.8x HBM capacity per pod (331.8 TB vs 49.2), and a pod size that grows 4.5x from 256 to 1,152 chips.

Behind those numbers is a rethinking of the network. Vahdat explained the insight directly: Google’s default way of linking chips together favors bandwidth over latency — good for transferring large amounts of data, not designed for minimal response time to get it back. She works for Profile Training. For agents, not so. In partnership with Google DeepMind, the TPU team created what Google calls Boardfly Topology specifically to reduce network diameter – minimizing the number of hops between any two chips in a pod. Paired with the Collective Acceleration Engine and what Google describes as very large on-chip SRAM, the 8i delivers a claimed 5x improvement in latency for real-time LLM sampling and reinforcement learning.

The vertical-integration gap: why Google doesn’t pay "nvidia tax"

The subtext in Vahdat’s presentation was a six-layer diagram of what Google calls its AI stack: energy at the foundation, then data center land and enclosures, AI infrastructure hardware, AI infrastructure software, models (Gemini 3), and services on top. Designing each layer separately forces you to have a least common denominator for each layer, Vahdat said. Google designs these together.

This is where the competitive story unfolds for IT buyers and analysts. OpenAI, Anthropic, XAI, and Meta all rely heavily on Nvidia silicon to train their Frontier models. Every H200 and Blackwell GPU they buy carries Nvidia’s data-center gross margin – unofficial "nvidia tax" Industry analysts have flagged this as a structural cost disadvantage for anyone renting rather than designing for two years running. Google pays the fab, packaging and engineering costs on its TPUs. It doesn’t pay that margin.

What v8 means for the compute race: A new assessment checklist for IT leaders

For procurement and infrastructure teams, TPUv8 has reframed the 2026-2027 cloud assessment in concrete ways.

Teams training large proprietary models should focus on 8T availability windows, Virgo networking access, and goodput SLAs – not just headline eFLOPS. Teams serving agents or reasoning workloads should evaluate the availability of 8i on Vertex AI, the emergence of independent latency benchmarks, and whether the HBM-per-pod size fits their context window. Teams consuming Gemini through Gemini Enterprise should inherit the 8i lift and should expect the range of what can be deployed in production to grow meaningfully by 2026.

The warnings are real. General availability is still there "Later in 2026." V8 is a roadmap hint, not a purchase decision today. Google’s benchmarks are self-reported; No doubt independent numbers will come from early cloud customers and third-party evaluators over the next two quarters. And portability between JAX/XLA and the CUDA/PyTorch ecosystem remains a friction cost worth thinking about when negotiating any multi-year commitment.

Looking ahead, Vahdat made two predictions worth noting. First, general-purpose CPUs will see a resurgence inside AI systems – not as accelerators, but as agent sandboxes, virtual machines, and tools to orchestrate computation for execution. Second, clearly framed as an industry prediction rather than a Google Roadmap Preview, the expertise also remains strong. As general-purpose CPUs grow by a few percent per year, workloads that matter will demand purpose-built silicon. "Two chips can be more," Vahdat said – without specifying whether "More" This would mean other classes of future TPU variants or specialized accelerators.

The question in the Frontier Compute Race was who could buy the most H100s. Now the question is who controls the stack. At the moment, the short list of companies actually doing this is two: Google and Nvidia.

<a href

Google doesn't pay the Nvidia tax. Its new TPUs explain why.

"One chip per year was not enough": Inside Google’s 2024 bet on a two-chip roadmap

TPU 8t: a training fabric that holds up to a million chips

TPU 8i and Boardfly: Re-engineering the network for agents

The vertical-integration gap: why Google doesn’t pay "nvidia tax"

What v8 means for the compute race: A new assessment checklist for IT leaders

Like this:

Related

Leave a Comment Cancel reply

"One chip per year was not enough": Inside Google’s 2024 bet on a two-chip roadmap

TPU 8t: a training fabric that holds up to a million chips

TPU 8i and Boardfly: Re-engineering the network for agents

The vertical-integration gap: why Google doesn’t pay "nvidia tax"

What v8 means for the compute race: A new assessment checklist for IT leaders

Share this:

Like this:

Related

Leave a Comment Cancel reply