Research Notes

Google Cloud Next 2026: Google Cloud Bifurcates the AI Future – Specialized TPU 8t and 8i Architectures Signal the End of General-Purpose Silicon

Research Finder

Find by Keyword

Google Cloud Next 2026: Google Cloud Bifurcates the AI Future - Specialized TPU 8t and 8i Architectures Signal the End of General-Purpose Silicon

TPU 8t and TPU 8i split training from serving as Google targets 2.8x training gains and 80% inference price-performance improvements over Ironwood; approach is Google's answer to NVIDIA Vera Rubin NVL72 and AWS Trainium3/

04/22/2026

Key Highlights

  • Google introduced its eighth-generation TPU lineup at Google Cloud Next 2026, marking the first time Google has fielded two truly distinct TPU SKUs in a single generation: TPU 8t engineered for large-scale training and designed by Broadcom, and TPU 8i engineered for reasoning and serving designed by MediaTek.
  • TPU 8t superpods scale to 9,600 chips with 2 petabytes of shared HBM and are designed to deliver up to 2.8x better training price-performance than seventh-generation Ironwood.
  • TPU 8i triples on-chip SRAM to 384 MB, doubles inter-chip interconnect bandwidth to 19.2 Tb/s, and introduces a new Boardfly topology aimed at reducing network diameter by roughly 56% for MoE and reasoning workloads.
  • Both chips are hosted by Google's Axion ARM-based CPUs and support native PyTorch (TorchTPU in preview), JAX, vLLM, SGLang, and bare-metal access, signaling a deliberate effort to reduce framework lock-in perceptions.
  • Our analysis suggests the bifurcation is less about raw FLOPs and more about Google conceding that a single topology cannot efficiently serve both dense training and agent-swarm decoding, a structural admission that reframes the competitive narrative against NVIDIA Vera Rubin NVL72 and AWS Trainium3.

The News

At Google Cloud Next 2026, the company unveiled its eighth generation of Tensor Processing Units. The announcement introduces two purpose-built architectures, TPU 8t for training and TPU 8i for inference and reinforcement learning, co-designed with Google DeepMind and hosted entirely on Google's Axion ARM-based CPUs. While Google has created suffix variants of TPUs in the past, this is new because the chips are distinct from the ground up. The company is pairing the silicon with a new Virgo Network data center fabric, fourth-generation liquid cooling, Google Cloud Managed Lustre advancements, and GKE orchestration enhancements designed for agent-native workloads. Google is positioning the family as the foundation of its AI Hypercomputer for the agentic era, claiming up to 2.8x better training price-performance and 80% better inference price-performance over current seventh-generation Ironwood. Both chips are expected to reach general availability later this year, and additional details are available via the Google Cloud announcement.

Analyst Take

The announcement matters because Google is landing on an answer to a big question the industry has been circling for two years. Do the economics of agentic AI reward specialization or integration at the silicon layer? Google's answer is in, specialization all the way. By introducing TPU 8t and TPU 8i as distinct systems rather than a single unified accelerator, Google is architecting against the assumption that one chip can simultaneously carry trillion-parameter pre-training and swarm-scale reasoning decode.

Our read is that this move also gives a tacit acknowledgment that the seventh-generation Ironwood narrative, positioning a single chip as the definitive age of inference accelerator, understated how different serving and training really are under agentic loads. The DeepMind co-design angle is not cosmetic. With world models like Genie 3 requiring agents to rehearse in simulation, the topology and memory choices in TPU 8i appear deliberately shaped by internal research roadmaps rather than generic benchmark competition.

We see that Google’s bifurcation of the TPU v8 line suggests a strategic bet that the General Purpose AI Chip is a fading category, replaced by a world where training is a massive batch-process and inference is a high-velocity, swarm"activity. By partnering with MediaTek for the TPU 8i, Google is not just diversifying its supply chain, but applying mobile-edge efficiency logic, low power, high-volume, and near-zero latency, to the data center to counter NVIDIA’s high-margin, power-hungry dominance.

The move to a Boardfly topology in TPU 8i is a direct architectural admission that for agentic reasoning, the physical distance a data packet travels (network diameter) is now a more critical bottleneck than raw mathematical throughput. Furthermore, the 2 PB unified memory claim for TPU 8t shifts the competitive focus from “Who has the best chip?" to "Who has the best fabric?", as Google attempts to out-engineer NVIDIA's NVLink by making 9,600 chips behave as one coherent brain. From our viewpoint, this specialization-first roadmap implies that Google expects the 2027 AI market to be dominated by agentic swarms that require constant, low-cost imagination rather than just the periodic, massive training runs that defined the early LLM era.

What Was Announced

TPU 8t is architected around Google's proven 3D torus interconnect, scaled to 9,600 chips per superpod with 2 petabytes of unified high-bandwidth memory and double the inter-chip bandwidth of the prior generation, delivering 121 FP4 ExaFLOPs per pod compared with 42.5 FP4 ExaFLOPs on Ironwood. The system introduces native FP4 precision, balanced Vector Processing Unit scaling to reduce exposed non-matrix time, a SparseCore block dedicated to embedding lookups, and TPUDirect RDMA and TPUDirect Storage paths that aim to keep the matrix units saturated during large multimodal training runs. Google says the design targets more than 97% "goodput," with automatic rerouting around failed ICI links and Optical Circuit Switching that reconfigures hardware around failures without operator intervention. This chip is designed by Broadcom, which remains Google's long-term partner for the most technically demanding silicon. As the partner for TPU 8t, Broadcom handled the complex silicon implementation, high-speed SerDes interconnects, and advanced packaging required for massive-scale training.

TPU 8i is the more architecturally distinct chip. It pairs 288 GB of HBM with 384 MB of on-chip SRAM (roughly 3x the previous generation) so that KV caches for long-context reasoning can sit on silicon rather than spilling to host memory. The die-level tradeoff is notable: TPU 8i carries two Tensor Core on-core dies alongside one new Collectives Acceleration Engine on the chiplet die, physically replacing the four SparseCore blocks that shipped on Ironwood. Google is effectively giving up some embedding acceleration on the serving chip in order to dedicate silicon real estate to the collective reductions that dominate auto-regressive decoding, and the company claims up to a 5x reduction in on-chip collective latency as a result. Pod-level gains are also striking, with TPU 8i pods scaling to 1,152 chips and delivering 11.6 FP8 ExaFLOPs compared with 1.2 FP8 ExaFLOPs on a 256-chip Ironwood pod.

Most notably, TPU 8i abandons the 3D torus for a new high-radix Boardfly topology, which the company says cuts the maximum network diameter of a 1,024-chip configuration from 16 hops down to seven, a 56% reduction. In a strategic move to diversify its supply chain and reduce costs, Google partnered with MediaTek for the TPU 8i. Google leveraged MediaTek’s expertise in power-efficient, high-volume mobile SoC design to create a cost-optimized inference chip that is reportedly 20-30% cheaper to produce than traditional high-performance variants.

Both chips run on Axion hosts, support native PyTorch via TorchTPU in preview, and introduce bare-metal access to TPUs for the first time.

Market Analysis

The competitive framing is the most interesting part. AWS announced Trainium3 at re:Invent 2025 as a single-SKU, 3nm accelerator aimed at both training and high-end inference, with roughly 2.52 PFLOPs of FP8 compute and 144 GB of HBM3e per chip, and a clear message of convergence between Trainium and Inferentia lines. NVIDIA, meanwhile, introduced the Vera Rubin NVL72 platform at CES 2026 with 72 Rubin GPUs per rack delivering approximately 3.6 EFLOPs of NVFP4 inference, alongside a context-processing variant (Rubin CPX) intended to offload prefill work. Observed together, the three hyperscale silicon roadmaps are diverging: AWS converges under one SKU, Google bifurcates across specialized SKUs, and NVIDIA scales up within the rack while scaling out via larger POD configurations. Our read is that each approach optimizes for a different constraint, and the 2027 production data will reveal which constraint matters most for agentic workloads at scale.

Beneath the SKU-strategy conversation sits a more technical differentiation: memory fabric architecture. Google and NVIDIA are both building on HBM4, which means the chip-level memory story is effectively at parity. The durable differentiation sits in how each architecture unifies memory across the accelerator fleet. NVIDIA's Vera Rubin NVL72 follows a scale-up philosophy engineered for maximum flexibility and ultra-low latency within a single rack environment. Beyond the rack, multiple NVL72 systems link via InfiniBand to achieve petabyte-class aggregate memory, but that data has to traverse conventional networking protocols, which introduces latency. NVIDIA's counter-move is the larger Vera Rubin POD, which can scale to 1,152 GPUs across roughly 40 racks and relies on Context Memory Storage to manage the massive KV caches required for trillion-parameter models.

The result is enormous aggregate capacity, but it remains a distributed system rather than a unified memory domain. Google's approach is architecturally different. The TPU 8t superpod uses the proven 3D torus Inter-Chip Interconnect to knit 9,600 chips into what Google describes as a single global address space, with 2 petabytes of HBM accessible as a unified pool rather than a federation of rack-level domains. Because the ICI integrates directly into the silicon, Google bypasses the performance cost of standard data center networking for the intra-superpod path. Our analyst estimate is that Google's unified memory pool within a TPU 8t superpod is roughly two orders of magnitude larger than what an NVL72 rack domain exposes as truly shared memory, which we believe is the single most important architectural claim embedded in today's announcement. Whether that matters for a given workload depends on how memory-cohesive the training job needs to be, but for frontier pre-training where gradients and activations must flow across the fleet with minimal latency, the gap is real and quantifiable.

This memory-fabric divergence also clarifies why Google and NVIDIA are frequent co-residents inside Google Cloud Platform despite competing fiercely at the silicon level. The A5X bare-metal instance brings Vera Rubin NVL72 into GCP as a first-wave deployment, and Google and NVIDIA are co-engineering the open-source Falcon networking protocol via the Open Compute Project, with A5X implementing several concepts that originated in Falcon. Customers generally want both architectures under the same roof for different workload profiles, as well as supply chain optionality, and Google's decision to host its most direct competitor is a deliberate customer-choice strategy rather than a concession.

The more consequential subtext is that electricity, not wafer supply, remains the binding constraint on AI infrastructure. Google is mapping to that structural reality by designing to: roughly 6x more compute per unit of electricity versus five years ago, aiming at up to 2x performance-per-watt over Ironwood, deploying fourth-generation liquid cooling, and integrating network-on-chip designs that reduce the energy cost of moving data across a pod. For context, a single NVIDIA Vera Rubin NVL72 rack is projected to draw roughly 120 kW, about the equivalent of 40 average American homes, which reframes what "winning" means at hyperscale. Multiple analysts are flagging power and AI infrastructure grid interconnect timelines as the dominant gating factor on new data center capacity through 2028. Our analysis is that the eighth-generation TPU's most durable competitive advantage may end up being system-level energy efficiency rather than peak FLOPs, because the chip that serves the most useful tokens per megawatt is the chip that gets deployed at scale.

At the orchestration layer, Google's GKE Inference Gateway introduces a "predictive latency boost" that uses machine learning-driven, capacity-aware routing in place of heuristic load balancing, with the company claiming more than 70% reduction in time-to-first-token latency without manual tuning. Paired with lm-d (recently accepted as a CNCF Sandbox project with Google as a founding contributor alongside Red Hat, IBM Research, CoreWeave, and NVIDIA), this positions Google to own the serving-layer experience for agentic workloads regardless of which accelerator sits underneath. McKinsey's recent analysis, alongside others, predicts that inference spend will outpace training spend in enterprise budgets through 2027.

That rationale bolsters Google's decision to design TPU 8i with distinct serving-first tradeoffs, instead of treating it as a SKU-variant training derivative as in gens 4 and 5. The validation proof point from October 2025 is Anthropic, which announced access to up to one million TPUs in a deal worth tens of billions of dollars. Anthropic is now expanding on that commitment in April 2026 via a Google and Broadcom agreement for multiple gigawatts of next-generation TPU capacity beginning in 2027. This massive customer signal, combined with Google's decision to expose bare-metal TPU access and native PyTorch support, suggests Google is working to reduce historical friction points (framework lock-in, virtualization overhead) that have kept some workloads on NVIDIA by default.

Looking Ahead

Based on what we are observing, the real battleground for 2026 and 2027 will not be peak FLOPs per chip but rather cluster-level goodput and cross-site scalability. Google's Virgo Network claim of connecting 134,000 TPU 8t chips within a single data center fabric, and more than one million chips across multiple sites into a single training cluster, is the most ambitious architectural assertion in the announcement and the hardest one to independently verify until workloads are running at scale. If the Virgo fabric delivers on the 47 petabits-per-second bisectional bandwidth and near-linear scaling properties Google claims, it meaningfully changes how AI labs think about geographic distribution of training. The second theme we will track is whether TorchTPU maturity actually translates into PyTorch-first shops migrating workloads, or whether the JAX and Pathways stack remains the path of least resistance for the Gemini-aligned ecosystem. Expect the Broadcom partnership alongside Google's supplier diversification strategies to drive the next TPU generation conversation by 2027.

 

Author Information

Ron Westfall | VP and Practice Leader for Infrastructure and Networking

Ron Westfall is a prominent analyst figure in technology and business transformation. Recognized as a Top 20 Analyst by AR Insights and a Tech Target contributor, his insights are featured in major media such as CNBC, Schwab Network, and NMG Media.

His expertise covers transformative fields such as Hybrid Cloud, AI Networking, Security Infrastructure, Edge Cloud Computing, Wireline/Wireless Connectivity, and 5G-IoT. Ron bridges the gap between C-suite strategic goals and the practical needs of end users and partners, driving technology ROI for leading organizations.

Author Information

Stephen Sopko | Analyst-in-Residence – Semiconductors & Deep Tech

Stephen Sopko is an Analyst-in-Residence specializing in semiconductors and the deep technologies powering today’s innovation ecosystem. With decades of executive experience spanning Fortune 100, government, and startups, he provides actionable insights by connecting market trends and cutting-edge technologies to business outcomes.

Stephen’s expertise in analyzing the entire buyer’s journey, from technology acquisition to implementation, was refined during his tenure as co-founder and COO of Palisade Compliance, where he helped Fortune 500 clients optimize technology investments. His ability to identify opportunities at the intersection of semiconductors, emerging technologies, and enterprise needs makes him a sought-after advisor to stakeholders navigating complex decisions.