Why CIOs need to reassess storage architecture for AI infrastructure

Ken Claffey, CEO of VDURA, explains why storage systems can and should be designed to deliver consistent availability and throughput in real-world conditions rather than just under ideal benchmarks.

The sheer scale of AI infrastructure is exposing limitations in architectures not designed for such enormous workloads. At the same time, demand for GPUs has surged, prompting cloud providers to shift from selling raw compute to delivering guaranteed outcomes as AI increasingly moves into production environments.

As a result, SLAs are now a commercial expectation rather than a point of differentiation. Providers are committing to high levels of rack-level uptime, raising expectations across the entire stack. Indeed, those that cannot meet agreed performance standards risk losing deals early in the buying process as organizations identify a growing gap between the promises being made and what the underlying infrastructure can consistently deliver.

A key part of this performance dynamic is the role played by storage. AI workloads don’t just depend on continuous, predictable compute capability; they also depend on continuous data delivery. GPUs can only process data if it is supplied without interruption, so any delay or gap in storage access immediately halts useful work. From an infrastructure specification perspective, this means that, to maintain overall system performance, storage availability must not only match compute but also exceed it.

Consider this scenario: if an AI storage system runs at 98% availability but compute runs at 99.5%, the combined service level drops to 97.5%, below the performance customers have been promised. But that’s not the only problem. At scale, this issue can quickly translate into significant idle GPU capacity and the risk of associated SLA penalties.

If the customer is operating 5,000 GPUs across 50 racks, that represents 876,000 lost GPU-hours or around $2.6M in idle compute costs annually, plus the contractual SLA credits that also apply. The point is that SLAs are only as strong as the weakest layer in the stack, which in most AI environments is storage. Those numbers will make any CIO sit up and take notice.

Built for a Different Era

To an extent, this is a legacy technology issue. Many AI storage environments in use today were originally designed as scratch storage, meaning they were optimized for short-lived, high-speed workloads rather than for sustained, SLA-backed production use cases synonymous with AI.

Granted, these systems perform well under ideal conditions, but problems can quickly become apparent if components fail, as they inevitably will at some point. This is a particular challenge for large-scale AI environments where software and hardware systems can easily include hundreds of nodes, and failures are to be expected. And we’re not just talking about major components falling over; even routine processes, such as metadata bottlenecks or network timeouts can interrupt data pipelines without a full system outage taking place.

In AI environments, these issues are amplified because systems require continuous, parallel, high-throughput access. Storage systems must, therefore, deliver that performance without interruption. The practical implication is that peak throughput is no longer the most meaningful metric; what matters more is how the system performs once components begin to fail or degrade.

Leaving the legacy

So, how can CIOs address this growing challenge? Fundamentally, system design needs to focus on sustained performance when infrastructure is under stress. In particular, resilience needs to be embedded into the storage architecture itself rather than added as a secondary layer.

Distributed, shared-nothing designs remove reliance on individual components, allowing systems to continue operating even when nodes fail. In this context, “shared-nothing” means that no node depends on a central resource, allowing each part of the system to function independently rather than relying on a single point of coordination or failure.

This contrasts with traditional architectures that depend on local redundancy, which is limiting at scale. As a result, the focus moves from protecting individual components to maintaining overall system integrity and availability. Automated data integrity checks help identify and isolate issues before they impact AI pipelines, while regular recovery testing under realistic conditions ensures that restoration processes can be carried out quickly enough to meet the needs of production environments. The underlying point is that storage systems can and should be designed to deliver consistent availability and throughput in real-world conditions rather than just under ideal benchmarks.

Given the current direction of AI infrastructure investment, these challenges will only become more apparent (and costly) as additional infrastructure comes online. Many organizations are still selecting storage based on peak performance, and it’s clearly a habit that is hard to break. But, if AI systems are to deliver the reliability users expect, it’s also a habit that has to change.

About the Author

Ken Claffey

Ken Claffey

Ken Claffey is CEO of VDURA, a modern data storage infrastructure software company purpose-built for AI and HPC workloads. With a track record of building and scaling businesses across the HPC and storage ecosystem, he has held senior executive roles at Seagate, Xyratex, Adaptec, and Eurologic.

At Seagate, Ken led the Enterprise Storage division through a period of transformative growth, driving full-stack product innovation to address emerging customer needs across Cloud and Enterprise markets. At Xyratex, he built the ClusterStor HPC storage business from the ground up — a platform that went on to power 40% of the world's top supercomputers before being acquired by Cray/HPE. He was also instrumental in the subsequent sale of Xyratex to Seagate, and held leadership roles spanning product, operations, sales, and engineering at Adaptec and Eurologic.

As CEO of VDURA, Ken brings the strategic insight and hands-on expertise to translate vision into results.

Sign up for our eNewsletters
Get the latest news and updates
Viktoriya/Shutterstock.com
Source: Viktoriya/Shutterstock.com
Sponsored
Jack Graves of Southwire explains why data centers built with thoughtful, balanced specifications don't have to choose between running hard and running clean.
Giga Energy
Source: Giga Energy
Sponsored
Data center operators can streamline their builds by avoiding three common mistakes. Angad Sandhu of Giga Energy outlines the most common missteps and how to avoid them with the...