Rethinking AI Infrastructure with GPU Diversity and TCO

Written by Rob Stevenson, Field CTO

Embracing GPU Diversity to Unlock AI Performance 

As the AI landscape continues to evolve rapidly, businesses are under growing pressure to build infrastructure that can adapt. One of the most effective strategies is embracing GPU diversity. Different GPUs offer different strengths. Some are optimized for inference, others for training, and some excel at image processing with advanced ray tracing capabilities. By matching the right GPU to the right workload, organizations can improve performance, increase efficiency, and build AI systems that are ready for both current needs and future demands. 

If you lock into a single GPU, that could limit the business’s ability to accelerate your AI workload. That’s why at Cerio, we think leveraging multiple GPUs, having a heterogeneous GPU environment, and composing on demand for the workload makes a lot of sense. 

The other part is that people are improving their agent capabilities through accuracy of response, and certain GPUs give slightly better performance and accuracy. You want to have a strong testing harness that can run your full test case with a new foundation model using a different GPU and then evaluate the accuracy. If the accuracy improves, that’s something you want to take advantage of, because it means the user experience will be more positive. 

So, increasing accuracy and protecting yourself from vendor lock-in are both important reasons to diversify GPU components in an AI factory. 

Measuring the Impact Through Cost and Efficiency 

You need to measure it and determine what the cost savings are. You can look at it in terms of token processing capability and accuracy and then factor in the actual physical costs. 

This is where TCO assessments come into play. You can analyze different GPU architectures, look at power consumption, cooling requirements, and so on. That becomes another way to continuously improve your agentic AI capability. 

We want to show people how to apply a continuous, balanced scorecard approach to TCO in order to improve AI operations over time. 

Staying Agile with a Stepping Stone Approach 

Since the industry is moving so rapidly, and when you have this level of uncertainty and volatility in the market, there’s definitely risk involved. But there’s also a high reward if you’re first to adopt a new capability in your AI architecture. 

At the same time, there’s also the risk of spending too much on underutilized assets. That’s why we recommend embracing a stepping stone approach. Test things early, figure out what works, measure it using your TCO scorecard, and then iterate. This helps you stay ahead of your competition instead of falling behind, and it allows your organization to learn faster.  

AI is changing how we think about business. Instead of viewing code as intellectual property, it’s really about the organization’s ability to learn and apply that learning inside the AI factory. That becomes the competitive advantage. 

Once you understand the impact of that learning, you adjust your hardware strategy. Incremental improvements, followed by testing, tracking accuracy, understanding cost, and then iterating again—that’s the key to staying competitive. 

Scaling Smarter with Through-Cycle Innovation 

What you’re trying to do is scale quickly once you find a solution that works. If you can outperform the competition by scaling faster, you can recover your investment sooner. 

Through-cycle innovation involves finding and deploying new assets that give you a competitive advantage. For example, you might choose lower-power GPUs that are still very effective on inference. Then you scale that quickly and track the accuracy across your agent AI processes. The goal is to balance your architecture so that you’re not locked into any one vendor and then stay adaptive. 

At Meta, for example, they use what they call a new product innovation process. They containerize a solution, push it out, and anticipate failures. Instead of spending eight months on a waterfall design process, they compress design into four to six weeks. They roll out a pod that may have some issues but still delivers good accuracy. Then they evaluate the cost, iterate, and spin up a better version. 

This approach lets you cycle through innovation quickly and gain an advantage. 

Building a Culture of Strategic Experimentation 

It really takes a shift in mindset. You can’t be overly risk averse. You have to think about strategic learning and how to balance your portfolio. 

For example, I typically keep about 75% of my activities in steady-state mode, using well-established solutions. The remaining 25% is for experimentation—those “mad money” areas where you test new ideas and push your organization’s capabilities. If one of those efforts proves to have a strong cost advantage or improves the accuracy of your agent AI, you can move it into production quickly. 

You need to have a sandbox area for experimentation, alongside your standard production workloads. A lot of infrastructure professionals have been trained to avoid touching production and eliminate anything that might cause disruption. That’s still important for stability, but it should be complemented with spaces for innovation. 

Just like you run TCO assessments for your steady-state systems, you should also evaluate your experimental setups. Track the accuracy, review the costs, and if something proves valuable, move it into production. 

You don’t need to follow a long waterfall design process anymore. Think in agile terms. Try smaller GPU solution stacks. Explore lower power consumption or experiment with tools like Dynamo and key-value stores. If you’re building a super pod, have a base pod doing innovation work at the same time. That keeps your team learning and gives your organization a path to quickly bring cost-saving ideas into production. 

Real-Time TCO: The New Standard for AI Infrastructure 

The key is that we’re now doing TCO calculations in real time and on an ongoing basis, rather than treating them as long, one-time projects. That’s a shift in thinking. 

In the past, you’d spend months building a chart of accounts to figure out your TCO and then look for ways to reduce CapEx or OpEx. Now we’re shortening that process and applying it directly to AI factory operations. TCO becomes an essential tool for making your AI operations more efficient. You improve it through quick iterations and small, manageable risks. That’s how you build a lean, responsive AI organization. 

What’s Driving Digital Transformation?

IDC Analyst Alex Holtz explores the trends and technologies changing the Media & Entertainment industry.
Download the IDC Research Brief