Video: The Disaggregated Data Center


One of the key misconceptions about composable disaggregated infrastructure (CDI) is that there’s an inherent performance trade-off. While some argue that the performance of a tightly integrated system will always be better, the reality is a little more nuanced.

Watch the second episode in our CTO Leadership Series: The Disaggregated Data Center. Cerio CTO Matt Williams and Dr. Ryan Grant, Assistant Professor at Queen’s University, discuss how performance can actually improve in disaggregated systems with the right design.

Disaggregated systems make it possible to increase overall GPU utilization by sharing GPUs in different ways than has ever been possible with a closed system architecture. With increased GPU utilization, you don’t have to pay a lot for that disaggregation so your returns are a lot higher than if you had a 5-10% overhead. You can increase GPU utilization and use systems better than how they’re being used today. The end result is a lower cost per GPU or accelerator deployed, along with the ability to adapt systems faster, and utilize those systems more to drive overall greater value.

Stay tuned for our next episode in the CTO Leadership Series: Designing AI Systems for Scale.

VIDEO: Cerio CTO Matthew Williams talks to Dr. Ryan Grant from Queen’s University about disaggregating the data center.

Video transcript

Hi, welcome to the second episode of our CTO Leadership series. Today we’ll be focusing on the disaggregated data center. I’m Matthew Williams. I’m the CTO of Rockport Networks, and with me again is Dr. Ryan Grant, one of the leading experts around the world for high-speed networking and scaling solutions. 

So thank you again for joining us, Ryan.  

Hi Matt. Thanks for having me. It’s great to be here.  

So there are some preconceptions about disaggregated systems, which I want to clear up. The first really is around performance, but there must be a performance impact. And it’s negative because my GPUs, my accelerators, are no longer co-located in my servers. 

When we take a look at a high level at what a multi-GPU workload actually looks like, here on the left-hand side, we have a disaggregated system with 16 GPUs located within the same chassis. And so, we have a server that’s composing multiple GPUs into that server from a PCIe perspective, it’s all within the same domain and its very high performance East West traffic, you know, full gen five performance within net. 

When we contrast that with a traditional GPU server doing multi workload, multi-GPU workloads, here we have four different servers each with four GPUs a piece, a total of 16 GPUs just like on the left. But in this case, we have a secondary fabric because these are all separate PCI domains. There’s a different network that joins those servers together, provides the east-west connectivity that can actually create a new performance bottleneck when compared with putting them all on the same pc. 

And I think it’s important to keep in mind here that those don’t necessarily need to be GPUs, right? So, one of those servers on the right-hand side could be tensor processing units or some other type of AI specific hardware, they could have some very specialized domain specific hardware in them. 

And when you’ve got them on a disaggregated system, it’s easier to access those things and schedule the use for them. So, from perspective of that, you’re not necessarily trapping accelerators inside of a node reservation that somebody else is using, but fully utilizing all the resources in that node as well. 

So, I think there’s some advantages to having that broader resource pool aside from just the ones that you outlined here in terms of flexibility for scheduling and getting jobs through and done for users, which is what they really care about is time to completion of their job. And that includes time waiting to access the resources. 

Absolutely. And on the left-hand side, if I have a job at 16 GPUs, I can take any one of the servers, I can give those, any of those GPUs or other accelerators, and the job gets started immediately. On the right-hand side, I’m waiting for a specific node type, and that can mean I’m sitting in the queue for quite a long time. So, I’m waiting for the results. I’m waiting longer to get the data.  

Exactly. I normally spend more time with my jobs waiting in the queue than they actually spend executing anyways. So it’s nice to be able to get access to the resources that you need when you need them.  

Absolutely. So, let’s take a little bit more detailed look at GPU deployment options. 

 So on the left-hand side, we have a disaggregated system, lots of different kinds of servers, standard servers and pools of resources. On the bottom here, we’re showing device enclosures with 8 GPUs. So when I want to run a job, in this case, my optimum number GPUs per server is eight. So I compose VI as needed. 

My job starts on the right hand side. In the traditional infrastructure, I may only have access to four GPUs within a server. So it’s not an optimum ratio, but of course I’m waiting for a specific server to kick off. When you take a look at the cost of these two solutions as well. You can actually see tremendous cost, benefits and disaggregation, which means from an end user perspective, the whole pool of resources. 

You have more GPUs and work satellites available than you would have with a within a traditional GPU server. I think the key takeaway here too, Matt, is that I can better balance the servers in my system. So if I need a lot of CPU compute, I add a whole bunch of servers just like I would today, or I don’t have to put accelerators in them. 

But if I need a lot of accelerators, I don’t necessarily have to buy the servers just because I need PCIe slots to put the accelerators in them. I might not even be using those heavily during a job at all anyways. So it’s useful to be able to mix and match the resources that I need rather than being trapped just because of the traditional server rack architecture into buying resources that I don’t necessarily need for my job or for users to get their results through faster. 

Yeah, agreed. So let’s take a closer look at the economic advantages I just mentioned that you can get more GPUs. So here we’re showing a contrast of three different kinds of approaches. The one in purple on the left-hand side of a graph is a highly dense integrated solution. The one in the middle is really building your own systems with OEM servers and, and GPUs to build your own four GPUs per server infrastructure. And the right-hand side we’re showing row scale disaggregation. So, you can see where right hand kennels within that graph that the cost per GPU of all the infrastructure for GPU itself that you deploy these GPUs is much, much lower. When we look at row scale disaggregation, to the point that here we’re modeling 32 GPUs on the left-hand side of a highly dense solution and 48 on a row scale, and yet the total system cost is over 30% less expensive compared with the highly dense integrated solution. 

So not only are you getting better utilization of resources, faster job startup time, but overall cost of infrastructure is lower. There are tremendous economic advantages, much higher performance per PCIe. Even if you neglect the economic advantage, the highly dense integrated solution is the way it is simply because you want a lot of GPUs very close physically to each other and all attached to the same bus. 

So you’re essentially getting a similar situation at row scale as well because you’re concentrating the GPUs physically close to each other as well. You’re getting both physical closeness for physics of speed of light issues for communication there. And you’re getting the case where your GPUs can communicate amongst themselves. 

And there’s a lot of communication that happens amongst GPUs in large AI and ML jobs, as well as scientific computations of the main users of all the GPU’s cycles. So there’s some definite advantages there in terms of being able to have GPUs communicate effectively with themselves because they’re so close together, rather than being tied into individual server rack box. 

For sure. So you know, the utilization of GPUs and accelerators is a very strong indicator of performance and value seen by end users. Now, how much of a time those GPUs are actually doing work is a very important metric. Ryan, can you tell us a little bit more about the data of the graphs?  

Sure. So this is some of the recent state-of-the-art data from the scientific community. Looking at different things like on the left-hand side, GPU utilization of different AL/ML models. And on the right-hand side, a very detailed trace of the utilization throughout a job’s runtime of a quantum chemistry application. So, the key takeaways here are that GPU utilization can vary quite a bit in your AI and ML job, from very, very high utilizations to relatively low utilizations. 

You can see here most of these codes on the left hand. Our image processing related codes, and you can see differences between 90% utilization and about 20% utilization. That doesn’t mean that the ones that are doing a 90% utilization are doing a better job necessarily. Some of them are using these GPUs in a more efficient way. 

So overall, the GPU utilization though, tells you that you’re not necessarily using your GPU for the entire length of that job. Now, the graph on the right sort of illustrates the two different ways that you cannot. Perfect GPU utilization, right? So the first section of that graph on the right is about from zero to say 50 seconds is the time that you’re just setting up your job. You’re getting used to your data, you’re loading, you’re doing a whole bunch of compute on the CPU side. You’re not even launching anything on the GPU or you’re not using it at all. That’s an opportunity to reclaim some of that GPU and let somebody else use that GPU during that time. 

So you have it reserved for you, but you’re not actually using it. That’s a key element right there. And then you can see that there’s a sort of high frequency of GPU jobs running. One’s completing and not using as much of the overall compute resources. And then it is ramping back up. That’s that sort of high frequency center of the graph. 

And then you can see at the end of the job, you’re also having a low utilization period as well. As you wrap up data, you pull that data off of the card itself. So, with the disaggregated system, there’s some opportunities to potentially loan those GPUs to other jobs. That’s a lot less complicated to do. 

Then when the GPUs are attached to the compute node that’s currently dealing with all the data, currently has the application running, it’s a lot easier to tease these things out. So there’s some interesting research opportunities here and interesting deployment opportunities on different workloads where we can potentially reclaim some of those GPU cycles for doing other useful tasks. 

I guess the idea here is instead of having the GPU reserved, because I’m going to need it, instead reserve a GPU and use the GPU when the job requires it. So that sort of late binding of resources to the servers and allow that GPU to be used for other jobs in the meantime makes much higher utilization, which of course means you’re getting better results, faster results overall. 

Exactly. And it’s key to point out here too that our GPUs spend a lot of time in these jobs actually doing communication amongst themselves as well. So having them very close physically to each other can help accelerate that communication by not having to have them communicate over long distances which further increases your overall utilization in terms of doing the actual job that you want the GPU to. 

That key compute for you, spending a little bit less time communicating just because they’re so dense and physically located next to each other? Absolutely. So we’ve shown the flexibility and benefits of disaggregation. We’ve shown the utilization benefits, we’ve shown the cost benefits, but there’s still going to be like I said, preconceptions. 

Yeah. My job may run slower. So really interesting paperwork was presented by Taylor Groves last year at SC22, analyzing different sizes of different kinds of jobs and different codes and how much are impacted by disaggregation. And I have to say that the results here really do surprise me. I was expecting pretty significant impact to CDI on a per GPU basis. That was really the advantages of CDI would overwhelm that and give you an overall better value proposition. The results show a different, very different story where the impact is actually much lower than people might expect. I was surprised by this, Matt, quite, quite surprised that to see a big disaggregation penalty, you really must run very, very small problems. 

So, you’ll see down at the bottom of the graph. Those are a matrix multiply used widely in all of the codes that we’ve discussed today. They sort of sit on that line between, you might as well do it on the CPU because they’re so small at those sizes. The milk runs that you’ll see in the disaggregated portion there, the one to 5% range. 

Those are very, very small runs. Domain science is not my forte, but I know that there are very, very small messages for that particular domain. And the lamp snap point on this graph, which is well under 1%, is in that range, that’s an application of molecular dynamics application. 

That’s running with a more commonly sized problem. So, I think the key takeaway here is, you know, the more realistic, your problem sizes, solving real world things that matter what the users actually want. The impacts seem to go to very, very small overall impacts on these jobs, which I was quite surprised at, thinking that most of the desegregation penalties would be much further down to the lower left of that graph, and it’s just not the case with this. 

We’ve been analyzing similar results internally, and you know what it seems to be is that these applications are not just transferring data and computing. What’s actually happening is transfers and computes are going on at the same time. So if a transfer of the next batch of data occurs while a current batch of data is being computed, as long as that data arrives in time for that next sort of compute cycle, there’s really no impact to disaggregation. So, we’re finding similar things from our work as well.  

Thank you. So, yeah, so some key takeaways for what we’ve talked about. The first is that disaggregation has this preconception in the market where our performance is going to be lower, but in fact, not only do you get access to different kinds of accelerators and GPUs, but you get access to more of them, the right amount of GPUs, the right number of accelerators for the server to be optimized for your job. 

So it’s really a different kind of story here. It’s disaggregation gives you cost advantages, it gives you flexibility, it gives you the right kind of compute for the right kind of job, and you get better performance.  

And I think there’s some opportunities here as well, Matt, to increase our overall GPU utilization just by sharing GPUs in different ways than they’ve been shareable in the past. Because it decouples from those individual nodes. It’s because the GPUs are so close with each other, you can very easily communicate and expand a job. But the fact here is that you also don’t really need to make up much for it, right? So potentially the GPU utilization going up a little bit, you don’t pay a lot for that disaggregation. So the returns are available much easier than they otherwise would be. If you had a five or 10% overhead, you’d have to get a five or 10% better result before you broke even. In this case, we see that that breakeven point is very, very low. So we have opportunities to increase our overall GPU utilization and just use our systems better than we’re using them today. Yeah, really it’s that you’ve got lower cost for GPU deployed or per accelerator deployed, the ability to adopt these faster, and the ability to deploy even more within the pool really does drive greater value. There’s really is no downside to disaggregation for vast majority of use cases.  

So thank you Ryan, really appreciate your time today. Looking forward to our next episode where we’ll be focusing on designing AI systems for scale. 

Thank you very much. Thanks for having me. 

More Bandwidth does not equal Better Performance

Read more about these topics and three additional bandwidth utilization factors in the white paper, “Dispelling Bandwidth Myths in High-performance Networks.”
Get my copy