Research Notes

Is AMD’s Ecosystem Play Bigger Than Its Benchmark Win?

Research Finder

Find by Keyword

Is AMD's Ecosystem Play Bigger Than Its Benchmark Win?

Nine partners, three GPU generations, two geographies; AMD's reproducibility and heterogeneous orchestration story go beyond raw benchmarks

04/06/2026

Key Highlights

  • AMD Instinct MI355X GPUs passed 1 million tokens per second on MLPerf Inference 6.0, that crossed a production-scale threshold with both Llama 2 70B and GPT-OSS-120B workloads at multinode cluster scale.
  • The MI355X platform is designed to deliver approximately 3.1x more throughput than the previous-generation MI325X on Llama 2 70B Server, reflecting CDNA 4 architecture and native FP4 precision gains.
  • AMD introduced first-time MLPerf results on the GPT-OSS-120B and Wan-2.2-t2v text-to-video benchmarks, extending model coverage beyond large language models.
  • Nine ecosystem partners, including Dell, Oracle, HPE, and Cisco, reproduced AMD's MI355X results within 4% of AMD's own submission, signaling commercial reproducibility beyond controlled lab conditions.
  • AMD's absence from the DeepSeek-R1 and multimodal Qwen3-VL benchmarks leaves a measurable gap in workload coverage that enterprise buyers should weigh against the strong headline numbers.

The News

AMD announced its MLPerf Inference 6.0 results, demonstrating AMD Instinct MI355X GPU's performance across LLMs, a first-time text-to-video benchmark, and multinode cluster deployments. The MI355X, a product built on AMD CDNA 4 architecture at 3nm with up to 288GB of HBM3E memory, is intended to deliver competitive per-node throughput and efficient scale-out across cluster configurations. The company claims multinode scale-out efficiency of 93% to 98% across workloads, and surpassed 1 million tokens per second on Llama 2 70B and GPT-OSS-120B at multinode scale. AMD also introduced a first-ever three-GPU heterogeneous submission spanning MI300X, MI325X, and MI355X systems across the United States and Korea. Full technical details are available at AMD's official blog.

Analyst Take

We have watched AMD operate as the perennial challenger in accelerated compute for years, and MLPerf Inference 6.0 feels like something has genuinely shifted. Not because AMD has dethroned NVIDIA in inference, it has not, but because the submission reveals AMD is converging on the right set of problems to solve at the right moment. The 1-million-tokens-per-second threshold is a compelling headline. But our primary read is not just that AMD's strong performance figure is notable. Our take is that with nine ecosystem partners reproducing AMD's benchmark numbers within 4% - that is a big win. Anyone who has managed GPU procurement at enterprise scale knows that lab results rarely survive contact with partner hardware, firmware variability, and real supply chains. When Dell, Oracle, HPE, and Cisco land within 1 to 4 percent of AMD's own submission, that is beyond marketing copy. That is operationally meaningful signal, and it is the kind of data that moves procurement conversations.

What Was Announced

AMD's MLPerf Inference 6.0 submit is based upon the AMD Instinct MI355X GPU, the current top of the MI350 Series. The MI355X is built on a 3nm process node, designed to deliver up to 10 petaflops of FP4 and FP6 inference performance. The GPU offers up to 288GB of HBM3E memory aimed at enabling single-GPU deployment of models up to 520 billion parameters. That memory capacity is crucial as model parameter counts and context windows grow.

The submission spans five areas of interest. On Llama 2 70B, AMD aims to show near-parity with NVIDIA B200, matching Offline throughput, reaching 97% of Server throughput, and exceeding B200 in Interactive at 119%. Against the newer NVIDIA B300 single-node, the MI355X platform achieved 93% in Server and 104% in Interactive. On the newly introduced GPT-OSS-120B benchmark, AMD submitted first-time results reaching 111% of NVIDIA B200 Offline and 115% of NVIDIA B200 Server at single-node scale.

The multinode scaling results are worth close attention. At 11 nodes and 87 MI355X GPUs, AMD aimed at, and achieved, approximately 3.1x more aggregate throughput than the previous-generation MI325X on Llama 2 70B, while sustaining 93% scale-out efficiency in both Offline and Server modes. A first-time Wan-2.2-t2v text-to-video submission rounded out the coverage, achieving 93% of NVIDIA B200 single-node performance in the official submission.

Market Analysis

The broader inference market context reinforces why this submission matters. According to Deloitte, inference is expected to account for two-thirds of all AI compute by 2026. The majority of that inference will remain on data center hardware rather than edge silicon. That structural dynamic puts sustained pressure on enterprises to select inference platforms that can scale predictably and operate across heterogeneous hardware generations, not just deliver peak throughput on homogeneous configurations. Deloitte's 2026 Tech Trends analysis further identifies a 280-fold reduction in per-token inference costs over two years, yet enterprise monthly AI bills are still reaching tens of millions of dollars, because usage growth has dramatically outpaced cost reduction. That paradox creates a clear opening for inference platforms that can improve cost-per-token economics at cluster scale, and it is one of the stronger strategic arguments AMD can make for the MI355X alongside its benchmark performance. Enterprises running always-on agentic AI workloads face continuous inference demands that make per-token efficiency as important as peak throughput, and an alternative platform that can demonstrably lower inference economics at scale carries real procurement weight when monthly bills are already at that magnitude.

AMD's heterogeneous three-GPU submission, combining MI300X, MI325X, and MI355X across geographically distributed systems in the United States and Korea, is designed to address exactly that buyer concern. The nine-partner ecosystem submission, spanning Cisco, Dell, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Red Hat, and Supermicro, reinforces commercial breadth. ROCm's ability to orchestrate across GPU generations and geographies moves from the realm of product brief to verifiable claim, but leaves the deeper ecosystem question unsettled. Developer toolchain depth continues to substantially favor NVIDIA CUDA across: breadth of optimized libraries, debugging tools, and community-contributed model integrations. For enterprises building net-new AI infrastructure stacks, that toolchain gap means longer spool-up timelines and potentially higher software integration costs. AMD is making visible progress, but ROCm parity with CUDA is a multi-year effort, not a current-round accomplishment.

Where we would caution enterprise buyers most directly is on benchmark scope. AMD did not submit results for DeepSeek-R1 or the multimodal Qwen3-VL benchmark in this round, two workloads where NVIDIA demonstrated significant performance leadership using 288-GPU GB300 NVL72 configurations. DeepSeek-R1 is particularly consequential: it has become the benchmark most closely watched by enterprise AI infrastructure teams evaluating real-world reasoning workload costs and throughput. AMD's absence leaves a gap in exactly the performance dimension where CIO conversations are currently focused, and it gives NVIDIA a largely uncontested data point in the workloads driving the most new infrastructure procurement decisions. Until AMD submits competitive DeepSeek-R1 results, the headline performance story from MLPerf 6.0 will remain contextualized by that omission in enterprise evaluations.

NVIDIA also showed that the same GB300 NVL72 hardware improved 2.77x in DeepSeek-R1 throughput between MLPerf rounds using only software optimization, yielding similar software-driven gains across other workloads. That trajectory reinforces a structural dynamic where inference leadership is increasingly determined by software and systems integration depth rather than silicon generation alone. For AMD, this creates an opportunity where the competitive bar is not totally reliant upon successive hardware releases. A necessary advance given that NVIDIA's ability to compound performance on installed hardware through TensorRT-LLM, Dynamo, and kernel-level optimizations effectively shortens the window between AMD's hardware advances and NVIDIA's software responses. Those dynamics will tighten further as both companies move toward rack-scale architectures later in 2026.

Looking Ahead

Our analysis of the market suggests that AMD's MI355X results in MLPerf Inference 6.0 represent a credible but incomplete claim on the inference infrastructure market. The trajectory we will be monitoring most closely is whether AMD can close the workload coverage gap in future MLPerf rounds, specifically on reasoning models like DeepSeek-R1 and emerging multimodal architectures, before NVIDIA widens that coverage advantage further through its NIM microservice stack and Dynamo serving framework. The heterogeneous three-GPU submission deserves continued scrutiny. If AMD ROCm can demonstrably orchestrate mixed-generation GPU clusters across geographic boundaries with predictable efficiency at enterprise scale, it opens a genuinely differentiated value proposition for organizations managing multi-vintage infrastructure on extended refresh cycles. That is not a benchmark story. That is a total cost of ownership story, and it may resonate with procurement teams who already hold MI300X deployments and need a clear, low-disruption upgrade path. With the MI400 Series on CDNA 5 architecture and the Helios rack-scale solution both on AMD's stated 2026 roadmap, the next MLPerf cycle will be a sharper test of whether AMD's annual cadence is compounding into lasting competitive advantage or remaining an ambitious catch-up effort.

Author Information

Stephen Sopko | Analyst-in-Residence – Semiconductors & Deep Tech

Stephen Sopko is an Analyst-in-Residence specializing in semiconductors and the deep technologies powering today’s innovation ecosystem. With decades of executive experience spanning Fortune 100, government, and startups, he provides actionable insights by connecting market trends and cutting-edge technologies to business outcomes.

Stephen’s expertise in analyzing the entire buyer’s journey, from technology acquisition to implementation, was refined during his tenure as co-founder and COO of Palisade Compliance, where he helped Fortune 500 clients optimize technology investments. His ability to identify opportunities at the intersection of semiconductors, emerging technologies, and enterprise needs makes him a sought-after advisor to stakeholders navigating complex decisions.

Author Information

Ron Westfall | VP and Practice Leader for Infrastructure and Networking

Ron Westfall is a prominent analyst figure in technology and business transformation. Recognized as a Top 20 Analyst by AR Insights and a Tech Target contributor, his insights are featured in major media such as CNBC, Schwab Network, and NMG Media.

His expertise covers transformative fields such as Hybrid Cloud, AI Networking, Security Infrastructure, Edge Cloud Computing, Wireline/Wireless Connectivity, and 5G-IoT. Ron bridges the gap between C-suite strategic goals and the practical needs of end users and partners, driving technology ROI for leading organizations.