Sponsored

Why CIOs Need to Reassess Storage Architecture for AI Infrastructure

Ken Claffey, CEO of VDURA, explains why storage systems can and should be designed to deliver consistent availability and throughput in real-world conditions rather than just under ideal benchmarks.

Ken Claffey

June 3, 2026

4 min read

Add Us On Google

Ken Claffey, CEO of VDURA (Source: VDURA)

The sheer scale of AI infrastructure is exposing limitations in architectures not designed for such enormous workloads. At the same time, demand for GPUs has surged, prompting cloud providers to shift from selling raw compute to delivering guaranteed outcomes as AI increasingly moves into production environments.

As a result, SLAs are now a commercial expectation rather than a point of differentiation. Providers are committing to high levels of rack-level uptime, raising expectations across the entire stack. Indeed, those that cannot meet agreed performance standards risk losing deals early in the buying process as organizations identify a growing gap between the promises being made and what the underlying infrastructure can consistently deliver.

A key part of this performance dynamic is the role played by storage. AI workloads don’t just depend on continuous, predictable compute capability; they also depend on continuous data delivery. GPUs can only process data if it is supplied without interruption, so any delay or gap in storage access immediately halts useful work. From an infrastructure specification perspective, this means that, to maintain overall system performance, storage availability must not only match compute but also exceed it.

Consider this scenario: if an AI storage system runs at 98% availability but compute runs at 99.5%, the combined service level drops to 97.5%, below the performance customers have been promised. But that’s not the only problem. At scale, this issue can quickly translate into significant idle GPU capacity and the risk of associated SLA penalties.

If the customer is operating 5,000 GPUs across 50 racks, that represents 876,000 lost GPU-hours or around $2.6M in idle compute costs annually, plus the contractual SLA credits that also apply. The point is that SLAs are only as strong as the weakest layer in the stack, which in most AI environments is storage. Those numbers will make any CIO sit up and take notice.

Built for a Different Era

To an extent, this is a legacy technology issue. Many AI storage environments in use today were originally designed as scratch storage, meaning they were optimized for short-lived, high-speed workloads rather than for sustained, SLA-backed production use cases synonymous with AI.

Granted, these systems perform well under ideal conditions, but problems can quickly become apparent if components fail, as they inevitably will at some point. This is a particular challenge for large-scale AI environments where software and hardware systems can easily include hundreds of nodes, and failures are to be expected. And we’re not just talking about major components falling over; even routine processes, such as metadata bottlenecks or network timeouts can interrupt data pipelines without a full system outage taking place.

In AI environments, these issues are amplified because systems require continuous, parallel, high-throughput access. Storage systems must, therefore, deliver that performance without interruption. The practical implication is that peak throughput is no longer the most meaningful metric; what matters more is how the system performs once components begin to fail or degrade.

Leaving the legacy

So, how can CIOs address this growing challenge? Fundamentally, system design needs to focus on sustained performance when infrastructure is under stress. In particular, resilience needs to be embedded into the storage architecture itself rather than added as a secondary layer.

Distributed, shared-nothing designs remove reliance on individual components, allowing systems to continue operating even when nodes fail. In this context, “shared-nothing” means that no node depends on a central resource, allowing each part of the system to function independently rather than relying on a single point of coordination or failure.

This contrasts with traditional architectures that depend on local redundancy, which is limiting at scale. As a result, the focus moves from protecting individual components to maintaining overall system integrity and availability. Automated data integrity checks help identify and isolate issues before they impact AI pipelines, while regular recovery testing under realistic conditions ensures that restoration processes can be carried out quickly enough to meet the needs of production environments. The underlying point is that storage systems can and should be designed to deliver consistent availability and throughput in real-world conditions rather than just under ideal benchmarks.

Given the current direction of AI infrastructure investment, these challenges will only become more apparent (and costly) as additional infrastructure comes online. Many organizations are still selecting storage based on peak performance, and it’s clearly a habit that is hard to break. But, if AI systems are to deliver the reliability users expect, it’s also a habit that has to change.

About the Author

Ken Claffey

Ken Claffey is CEO of VDURA, a modern data storage infrastructure software company purpose-built for AI and HPC workloads. With a track record of building and scaling businesses across the HPC and storage ecosystem, he has held senior executive roles at Seagate, Xyratex, Adaptec, and Eurologic.

At Seagate, Ken led the Enterprise Storage division through a period of transformative growth, driving full-stack product innovation to address emerging customer needs across Cloud and Enterprise markets. At Xyratex, he built the ClusterStor HPC storage business from the ground up — a platform that went on to power 40% of the world's top supercomputers before being acquired by Cray/HPE. He was also instrumental in the subsequent sale of Xyratex to Seagate, and held leadership roles spanning product, operations, sales, and engineering at Adaptec and Eurologic.

As CEO of VDURA, Ken brings the strategic insight and hands-on expertise to translate vision into results.

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

AI’s Execution Era: Aligned and Netrality on Power, Speed, and the New Data Center Reality

Sponsored

Get in Touch: Conduit Solutions for Data Centers

Sponsored

NECA Manual of Labor Rates Chart

Voices of the Industry

Sponsored

Why Modularization is Becoming the Blueprint for Modern Data Centers

Matt Johnson, business development manager, Xylem, explains why data center operators are embracing modular architectures to accelerate deployment, address workforce shortages...

Sponsored

Expanding On-Site Power Capacity: Planning Beyond the Genset

Caterpillar's Paul Cook outlines why aftertreatment is becoming an increasingly important part of long-term power strategy.

Why CIOs Need to Reassess Storage Architecture for AI Infrastructure

Built for a Different Era

Leaving the legacy

About the Author

Ken Claffey

Related

Vertiv Launches OneCore Modular Data Center Platform for AI and HPC

AI’s Execution Era: Aligned and Netrality on Power, Speed, and the New Data Center Reality

Get in Touch: Conduit Solutions for Data Centers

NECA Manual of Labor Rates Chart

Voices of the Industry

Why Modularization is Becoming the Blueprint for Modern Data Centers

Expanding On-Site Power Capacity: Planning Beyond the Genset

Trending

TeraWulf’s $19B Anthropic Lease Puts Its Brownfield AI Strategy to the Test

Data Center Insights 2026 Brings Industry Leaders Together for a Two-Day Look at the AI Infrastructure Era

Powering Canada’s AI Future: Electricity, Policy, and the Race for Data Center Leadership

Sponsored Picks

NECA Manual of Labor Rates Chart

Data Center Product Family Guide

Get in Touch: Conduit Solutions for Data Centers