Amazon’s Project Rainier Sets New Standard for AI Supercomputing at Scale

AWS’s Project Rainier is building one of the world’s largest AI supercomputing clusters, powered by custom Trainium2 UltraServers. More than just raw compute, Rainier reflects Amazon’s push to vertically integrate its AI infrastructure, redefine data center scale, and challenge Nvidia’s dominance in the hyperscale AI arms race.
July 8, 2025
8 min read

Announced at the end of 2024, Amazon is now well on its way to the creation and deployment of what it says will be the world’s most powerful computer for training artificial intelligence model, according to a recent blog post.

Amazon’s Project Rainier is a super-sized AI computing initiative by Amazon Web Services (AWS), aimed at building one of the world’s largest AI supercomputer clusters. The primary client is Anthropic (creator of the Claude AI models), backed by Amazon’s investment, using Rainier to train its next-gen LLMs. In March of 2024, Amazon concluded a $4 billion investment in Anthropic.

The project centers on a massive EC2 UltraCluster composed of Trainium2 UltraServers, powered by hundreds of thousands of AWS-designed Trainium2 AI chips. These second-generation chips incorporate custom silicon developed by Annapurna Labs and are purpose-built for training large-scale AI models. Each UltraServer houses 64 Trainium2 chips and delivers approximately 332 petaflops of sparse FP8 compute performance. But more than just raw compute, what truly distinguishes this effort is the distributed architecture.

Rather than being concentrated in a single location, the cluster spans multiple AWS data centers, enabling power and thermal optimization, while remaining tightly integrated through AWS’s Elastic Fabric Adapter, which provides ultra-low latency interconnect. This architecture allows geographically dispersed infrastructure to operate as a unified training system. According to AWS, the new cluster will deliver up to five times the computing power Anthropic previously used to train its Claude models.

Gadi Hutt, director of product and customer engineering at Annapurna Labs, the specialist chips arm of AWS, said of the cluster performance:

Rainier will provide five times more computing power compared to Anthropic’s current largest training cluster. For a frontier model like Claude, the more compute you put into training it, the smarter and more accurate it will be. We’re building computational power at a scale that’s never been seen before and we’re doing it with unprecedented speed and agility.

Project Rainier is part of a broader $100 billion AWS investment in AI infrastructure during 2025, an effort that places Amazon in direct competition with other hyperscale initiatives such as Microsoft and OpenAI’s Stargate. More critically, Rainier advances AWS’s long-term strategy of deep vertical integration, enabling the company to reduce dependence on Nvidia’s GPUs by scaling its own training hardware, lowering costs, and accelerating time-to-market for large language models.

What Is an UltraServer?

At the heart of Rainier are the Trainium2 UltraServers, high-performance AI training nodes engineered by AWS and built around its second-generation Trainium chips. Designed by Annapurna Labs, these custom ASICs are optimized for massive-scale model training and power the core of AWS EC2 Trn2 instances and UltraClusters.

The UltraServer architecture addresses one of the core bottlenecks in AI training: latency. Each server integrates 64 Trainium2 chips and leverages Amazon NeuronLink v2, the company’s proprietary chip-to-chip and server-to-server interconnect. Key upgrades in NeuronLink v2 include:

  • 2× bandwidth over the previous generation

  • Latency optimization tailored for AI training pipeline stages

  • Scalability to clusters of over 100,000 interconnected chips

AWS compares NeuronLink v2 to Nvidia’s NVLink but with tighter integration into the AWS software and infrastructure stack, enabling performance tuning across every layer of the system.

Each UltraServer is a shared high-bandwidth compute platform, designed with enterprise-grade reliability and a liquid cooling system that enables sustained delivery of up to 332 petaflops of sparse FP8 performance. The chassis also includes 8 TB of high-bandwidth memory (HBM) and dual-redundant power supplies, emphasizing reliability at scale.

By designing and manufacturing its own chips, servers, and supporting infrastructure, AWS gains end-to-end control of the AI stack up from the silicon level, and spanning across software orchestration, network topology, and even the physical layout and power architecture of the data centers that house them.

The importance of this level of control can’t be understated, as pointed out by Annapurna director of engineering Rami Sinno:

When you know the full picture, from the chip all the way to the software, to the servers themselves, then you can make optimizations where it makes the most sense. Sometimes the best solution might be redesigning how power is delivered to the servers, or rewriting the software that coordinates everything. Or it might be doing all of this at once. Because we have an overview of everything, at every level, we can troubleshoot rapidly and innovate much, much faster.

Sustainability Through Efficiency

Even amid the buildout of one of the world’s most energy-intensive AI compute platforms, AWS remains on track to meet its net-zero carbon goal by 2040, a cornerstone of Amazon’s corporate climate strategy.

“Our data center engineering teams, from rack layouts to electrical distribution to cooling techniques, are constantly innovating to increase energy efficiency,” said Hutt. “Regardless of the scale AWS operates at, we always keep our sustainability goals front of mind.”

According to Amazon, the company matched 100% of its energy usage with renewable energy in 2023, apparently achieving its goal of reaching 100% renewable power by 2030 seven years ahead of schedule. That early milestone now sets the stage for scaling high-performance AI workloads without compromising on environmental commitments.

Project Rainier also strengthens AWS’s ability to compete directly with Nvidia, giving customers a high-performance alternative to GPU-based infrastructure. By vertically integrating everything from chip design to cooling systems, AWS can lower the cost of training cutting-edge models (especially those in the 10–100 billion parameter range and beyond) while offering differentiated training infrastructure optimized for scale.

As with other AWS services, the goal is to provide customers flexibility and choice. Users can select the right compute engine (whether Trainium, GPUs, or other accelerators) based on performance and cost needs, all while staying within the broader AWS ecosystem.

This balance of innovation, integration, and sustainability reflects a salient industry trend of offering powerful AI infrastructure at scale, without sacrificing environmental responsibility.

Supersized Infrastructure for the AI Era

As AWS deploys Project Rainier, it is scaling AI compute to unprecedented heights, while also laying down a decisive marker in the escalating arms race for hyperscale dominance. With custom Trainium2 silicon, proprietary interconnects, and vertically integrated data center architecture, Amazon joins a trio of tech giants, alongside Microsoft’s Project Stargate and Google’s TPUv5 clusters, who are rapidly redefining the future of AI infrastructure.

But Rainier represents more than just another high-performance cluster. It arrives in a moment where the size, speed, and ambition of AI infrastructure projects have entered uncharted territory. Consider the past several weeks alone:

  • On June 24, AWS detailed Project Rainier, calling it “a massive, one-of-its-kind machine” and noting that “the sheer size of the project is unlike anything AWS has ever attempted.” The New York Times reports that the primary Rainier campus in Indiana could include up to 30 data center buildings.

  • Just two days later, Fermi America unveiled plans for the HyperGrid AI campus in Amarillo, Texas on a sprawling 5,769-acre site with potential for 11 gigawatts of power and 18 million square feet of AI data center capacity.

  • And on July 1, Oracle projected $30 billion in annual revenue from a single OpenAI cloud deal, tied to the Project Stargate campus in Abilene, Texas.

As Data Center Frontier founder Rich Miller has observed, the dial on data center development has officially been turned to 11. Once an aspirational concept, the gigawatt-scale campus is now materializing—15 months after Miller forecasted its arrival. “It’s hard to imagine data center projects getting any bigger,” he notes. “But there’s probably someone out there wondering if they can adjust the dial so it goes to 12.”

Against this backdrop, Project Rainier represents not just financial investment but architectural intent. Like Microsoft’s Stargate buildout in Iowa or Meta’s AI Research SuperClusters, AWS is redesigning everything, from chips and interconnects to cooling systems and electrical distribution, to optimize for large-scale AI training.

In this new era of AI factories, such vertically integrated campuses are not only engineering feats; they are strategic moats. By exerting control over the full stack, from silicon to software to power grid, AWS aims to offer cost, performance, and sustainability advantages at a time when those factors will increasingly separate winners from followers.

Ultimately, Project Rainier affirms a broader truth: the frontier of AI is no longer defined by algorithms alone, but by the infrastructure that enables them. And in today’s market, that infrastructure is being purpose-built at hyperscale.

 

Keep pace with the fast-moving world of data centers and cloud computing by connecting with Data Center Frontier on LinkedIn, following us on X/Twitter and Facebook, as well as on BlueSky, and signing up for our weekly newsletters using the form below.

About the Author

David Chernicoff

David Chernicoff is an experienced technologist and editorial content creator with the ability to see the connections between technology and business while figuring out how to get the most from both and to explain the needs of business to IT and IT to business.

Matt Vincent

A B2B technology journalist and editor with more than two decades of experience, Matt Vincent is Editor in Chief of Data Center Frontier.

Sign up for the Data Center Frontier Newsletter
Get the latest news and updates.