Research Finder
Find by Keyword
Hot Chips 2025: NVIDIA Ushers in the Scale Across Giga-Scale AI Super-Factories Era
NVIDIA's new Spectrum-XGS Ethernet scale-across technology is designed to connect multiple data centers into a single AI super-factory.
Key Highlights
- Spectrum-XGS Ethernet nearly doubles the performance of critical multi-GPU communication, enabling predictable performance across a vast network.
- This technology uses intelligent algorithms to manage latency and congestion over long distances.
- AI demand is pushing single data centers to their limits of power and capacity.
- CoreWeave is an early adopter, providing a real-world proof point for the technology's ability to overcome physical data center limitations.
- The system supports demanding use cases like training massive AI models and running complex digital twin simulations.
The News
NVIDIA announced NVIDIA Spectrum-XGS Ethernet, a scale-across technology for combining distributed data centers into unified, giga-scale AI super-factories. For more information, read the NVIDIA press release.
Analyst Take
As the demand for AI grows, individual data centers are hitting their limits in terms of power and capacity. To keep up, data centers need to expand beyond a single location, but traditional Ethernet networks can struggle with this because of high latency and inconsistent performance.
NVIDIA unveiled NVIDIA Spectrum-XGS Ethernet, a new technology designed to overcome these challenges. It is an addition to the NVIDIA Spectrum-X Ethernet platform that allows multiple data centers to be combined into a single, massive AI super-factory. Think of it as a third way to scale AI computing, beyond just scaling up within a single server or scaling out within a single data center. Spectrum-XGS makes it possible to connect distributed data centers, creating a unified, giga-scale system with predictable performance.
Spectrum-XGS Ethernet is a new feature within the Spectrum-X platform that's designed to connect distant data centers. It uses advanced algorithms that automatically adjust the network to account for the distance between locations. By managing congestion and latency, and providing full network monitoring, it almost doubles the performance of the NVIDIA Collective Communications Library (NCCL). This means communication between multiple GPUs and servers is faster and more reliable, allowing multiple data centers to function as one cohesive AI super-factory.
The NVIDIA Spectrum-X Ethernet platform provides 1.6x greater bandwidth than off-the-shelf Ethernet, which I expect can make it the top choice for massive, multi-tenant AI supercomputers. Featuring NVIDIA Spectrum-X switches and ConnectX-8 SuperNICs, the platform delivers the low latency and high performance essential for building scalable AI.
Giga-Scale AI Super-Factories Rising to Meet Demanding AI Use Cases
Unified, giga-scale AI super-factories are critical for supporting the most demanding and data-intensive AI workloads. A primary use case is the training of massive foundational models, which are too large to be handled by a single data center. By linking geographically distributed facilities, these super-factories enable the simultaneous training of models with trillions of parameters, drastically reducing the time required to develop breakthroughs in large language models, multimodal AI, and scientific discovery. Furthermore, I find that this unified approach is essential for running complex AI simulations, such as creating digital twins of entire cities or industrial supply chains, which require immense computational power to model and analyze real-world systems with high fidelity.
Beyond training, these super-factories are also vital for large-scale AI inference and deployment. This includes supporting massive AI services for a global user base, such as real-time conversational AI assistants, advanced image and video analysis for security and media, and personalized recommendations for e-commerce on a global scale.
By pooling resources across multiple locations, companies can provide low-latency AI services to users anywhere in the world, ensuring a seamless and responsive experience. The ability to operate as a single, cohesive entity also allows for more efficient resource allocation and load balancing, ensuring that computational resources are used optimally to meet fluctuating demand and drive innovation across every industry.
CoreWeave's (which accounts for 91% of NVIDIA's stock portfolio, with a stake valued at approximately $3 billion) deployment of NVIDIA Spectrum-XGS Ethernet provides nascent validation of NVIDIA's new scale-across technology. As a major provider of AI cloud infrastructure, CoreWeave's adoption signals to the broader market that Spectrum-XGS can prove a viable solution for overcoming the physical limitations of single data centers, such as constraints on power, space, and cooling. This partnership is set to provide a crucial, real-world proof point for NVIDIA's claim that its technology can maintain high-performance, low-latency communication across long distances, which would be essential for training and running giga-scale AI models.
NVIDIA Pacesetting the Competitive Landscape
From my perspective, NVIDIA's competitors in the data center networking space are diverse and include both traditional networking companies and other chip manufacturers. NVIDIA Spectrum-XGS Ethernet's ability to deliver competitive advantages are primarily centered on its scale-across breakthrough technology, which is specifically optimized to turn multiple, geographically dispersed data centers into a single, cohesive AI super-factory
Chip and hardware manufacturer AMD and Intel are key competitors, as they also offer server components and are developing their own networking solutions for AI. Historically, InfiniBand, a high-performance interconnect standard, has been a key competitor to Ethernet in HPC and AI, and NVIDIA itself has been a leader in this market. However, with Spectrum-XGS, NVIDIA is actively pushing an Ethernet-based solution to challenge InfiniBand's dominance in AI workloads.
Key data center networking competitors such as Cisco, HPE Juniper, Extreme Networks, and Arista offer high-performance Ethernet switches and network infrastructure. As such, they have strong market positions and are working on their own AI-optimized solutions, NVIDIA's advantage could come from its tight integration of networking through its firmly established AI hardware and software stack. Moreover, hyperscalers Amazon (AWS), Google (Google Cloud), and Microsoft (Azure) are also competitors, as they are developing their own custom networking hardware and software to optimize their internal AI infrastructure and services.
As a result, I anticipate the hyperscalers to increasingly look to reduce their reliance on NVIDIA for two primary reasons: cost and control. NVIDIA's dominant market position in AI GPUs, especially with its CUDA software ecosystem, enables the company to command a premium price for their chips. As these cloud providers build out massive AI infrastructure to serve both their own needs and external customers, the cost of acquiring and scaling NVIDIA's hardware can become a significant financial burden.
By designing their own custom AI chips, such as Google's TPUs, Amazon's Trainium and Inferentia, and Microsoft's Maia, hyperscalers can optimize performance and power efficiency for their specific workloads, which can lead to substantial cost savings over time. This strategic shift is a form of vertical integration, allowing them to better control their supply chain and avoid being dependent on a single, expensive vendor.
Furthermore, the desire for greater control over their technology stack is a major driver behind this trend. Relying solely on a single supplier like NVIDIA creates supply chain risk and can limit innovation. By developing their own silicon, hyperscalers can tailor the hardware and software to their unique cloud architecture and a wide range of proprietary AI applications. This not only allows for better performance and efficiency but also enables them to differentiate their cloud offerings and provide specialized services to customers.
While NVIDIA's CUDA ecosystem remains a powerful moat, these companies have the resources and engineering talent to build their own software layers and attract developers, gradually chipping away at NVIDIA's software advantage, especially for inference workloads which tend to be more predictable and less computationally intensive than training.
As declared, I expect that NVIDIA will emphasize that traditional Ethernet struggles with the high latency and jitter that come with connecting data centers over long distances. Spectrum-XGS features advanced algorithms for auto-adjusted distance congestion control, precision latency management, and end-to-end telemetry. NVIDIA claims this nearly doubles the performance of its NCCL, a critical component for multi-GPU and multi-node communication, making it uniquely suited for giga-scale AI workloads.
NVIDIA and the Ultra Ethernet Consortium
From my perspective, NVIDIA should have shared its perspective on its membership in the Ultra Ethernet Consortium (UEC) in the Spectrum-XGS Ethernet announcement. The launch is an opportunity to influence the future of a critical technology that directly competes with its own InfiniBand and Spectrum-X offerings. However, InfiniBand and Ethernet can and do coexist in the same data center. This is a very common practice in high-performance computing (HPC) and AI/ML environments, where network architects strategically use each technology for the workloads it handles best.
Through its membership, NVIDIA can ensure that the new Ultra Ethernet standard is interoperable with its hardware and software to protect its dominant position in the AI networking market. The UEC's goal is to create an open, multi-vendor Ethernet solution that provides the low latency and high performance needed for large-scale AI and HPC workloads, which can have direct bearing on NVIDIA’s ability to influence blended InfiniBand/Ethernet environments as well as unfolding InfiniBand-to-Ethernet transitions.
Furthermore, membership in the UEC enables NVIDIA to shape the direction of an emerging standard that is on a trajectory to shape a greater degree of the future of AI networking. The consortium aims to improve Ethernet's capabilities for high-performance computing, particularly by developing new protocols for Remote Direct Memory Access (RDMA), congestion control, and network telemetry. These are all areas where NVIDIA's InfiniBand and Spectrum-X platforms currently excel. By actively participating, NVIDIA can contribute its extensive expertise in these domains, ensuring that the new standards benefit from its innovations while also making its own products more compliant with the future direction of the industry. This is a proactive strategy to maintain relevance and leadership in a rapidly evolving market.
Looking Ahead
The AI industrial revolution is underway, and I anticipate that large-scale AI factories are the new essential infrastructure. With NVIDIA Spectrum-XGS Ethernet, NVIDIA is adding the new scale across dimensions to AI scaling. This technology links data centers across cities, nations, and continents, creating a single, massive AI super-factory.
To strengthen the Spectrum-XGS Ethernet announcement over the next 12 months, NVIDIA should focus on providing tangible proof points and a clear roadmap for customer adoption. While the initial announcement highlights its technical capabilities like scale-across and performance doubling, it needs to move from a conceptual breakthrough to a demonstrated solution. NVIDIA can achieve this by showcasing successful, real-world deployments with major hyperscalers beyond the initial CoreWeave partnership, providing detailed case studies with quantifiable results in areas like model training time and inference latency.
The company should also publish a more granular product roadmap, detailing when specific features, like support for even longer distances or new software integrations will become available as well as the role of UEC. This can build confidence among potential customers, particularly those who may be hesitant due to the competitive, rapidly evolving nature of AI networking. By validating its claims with data and demonstrating a clear path forward, NVIDIA can solidify its competitive advantage against rivals and encourage widespread adoption.
Ron Westfall | Analyst In Residence
Ron Westfall is a prominent analyst figure in technology and business transformation. Recognized as a Top 20 Analyst by AR Insights and a Tech Target contributor, his insights are featured in major media such as CNBC, Schwab Network, and NMG Media.
His expertise covers transformative fields such as Hybrid Cloud, AI Networking, Security Infrastructure, Edge Cloud Computing, Wireline/Wireless Connectivity, and 5G-IoT. Ron bridges the gap between C-suite strategic goals and the practical needs of end users and partners, driving technology ROI for leading organizations.