Old is New Again: Bandwidth will be the AI bottleneck

Lori MacVittie: [00:00:00] Hi everyone. Welcome. It's Pop Goes the stack, the podcast about emerging tech that insists it's disrupting something. I'm Lori MacVittie and I'm here to make sure it's not your uptime. Now, this week everyone's scrambling for GPUs like it's a Black Friday doorbuster, while the real meltdown is actually brewing in your network.

Cooling, yes. Power, of course. But nobody's checking on the swarm of chatter from all these conversations that are turning your network into a stress test with a body count. So today, Joel—it's always DNS—Moses is here,

Joel Moses: You betcha.

Lori MacVittie: as usual. And we brought contrarian Ken Arora to argue with us and say, “no, it's not the network.” I think that's your position. I'm not sure we'll [00:01:00] find out. What we know is that 43% of new data centers are being targeted just for AI. So we're building data center space and networks just for AI. There's really no argument against it creates more traffic and different traffic patterns, but let's dig into why. Why it's important and maybe what we can do about it. Go ahead, Joel. Joel,

Joel Moses: Sure. Absolutely.

Lori MacVittie: you start, because it's not DNS this time.

Joel Moses: It's not DNS, although, although DNS is hugely important to AI as well, but we'll skip that. We'll save that for as a topic for another time.

You know, AI networking is definitely important. It, but it, of course it depends on exactly what you intend to do for the AI workload. Obviously networking is important if you are training models, because you have to ingress, analyze and consolidate a large amount of information into the models that you're, that you're building. [00:02:00]

As such, infrastructure that was built only a few years ago, that was mostly based around 10G network, 10 gig networking, really has to be brought up to at least a hundred gig stand, standards. And in increasingly, you know, the volume of data that some of these, some of these training sets need, requires that pretty much everything be 400 gig across the board.

And that's from, going from an old model of constructing a data center with a user load, but a server load that may not be, may not exhibit a lot of data transfer characteristics. When you're analyzing data sets, object stored databases, and you're trying to consolidate them into vector databases, networking is fairly critical, otherwise you're training costs and training times balloon.

Ken Arora: So, I think a lot of that depends on also that line between, what do you call networking? And what do you call IO? Where do you draw it? Certainly training puts a lot of emphasis [00:03:00] on inter GPU communication inside your cluster for sure. Huge bounce of traffic, right? Whether you use InfiniBand or whether you use, you know, TCP, whatever.

It's a huge amount of bandwidth. So as the architect here, I, you know, it, there are layers to this, right? It's, this is a NUMA type model. I'm an old guy, so I think of computer architecture terms, right? This is a NUMA type model. Yeah. the closest layer you're gonna have a lot of communication. One layer out, maybe not so much. And then one layer out from there, which is what I'll call the north south traffic, you know, a lot of training's pretty self-contained.

And so, it really depends on what layer you're looking at. And as you said, training is one thing. The other part of the world is inferencing.

Joel Moses: Correct.

Ken Arora: Inferencing has its own set of requirements. And you know, if it's everything from complete standalone, which is very, you know, non-intensive in the network. And much more interested in your band, in your latency than your bandwidth.

Joel Moses: Right.

Ken Arora: And then all the way to, if you're doing [00:04:00] something that's more RAG oriented, yeah, you've got a little more bandwidth. But it's still, I guess I'll end with that ratio is really important. That ratio of GPU cycles to network bandwidth. That's what I look at. And GPUs, genAI is very heavy. So you're all of a sudden using a lot more CPU.

Are you necessarily using correspondingly more network? Not necessarily. Not in many cases.

Lori MacVittie: Okay. So you've got, I mean, two different scenarios here. One obviously, right, training where you are moving a lot of data across that boundary into this new AI factory, as we've been calling them, which is really that data center built out for it, because you need the GPUs to do the training.

When you're doing inferencing, you don't need the GPUs as much. But my question then becomes, you know, as we start deploying agents, are they gonna be in the old data center or are they going to be deployed in this AI factory? Because that changes the impact on the network because now you've got [00:05:00] all these weird traffic patterns going on and they use inferencing at the same time.

So you need to consider, right, how is that traffic going to impact the network inside your AI factory, if that's where you put them. Or will people put them in the old data center? Thoughts?

Ken Arora: Hmm, I don't, I don't know. I mean, inferencing, I think over time is gonna go increasingly closer and closer to the edge because two reasons. One, the economics of and the accel, the curve that GPUs are on, you don't need the latest and greatest, you know, fancy GPU accelerator.

Inferencing is a problem that you're gonna see increasingly go into more and more mainstream processors. Secondly, for many inferencing use cases, and also for agentic, latency is what matters more than bandwidth. You wanna get that answer quickly.

You've got real time apps and agentic just makes it worse because now you're not just making one inference request, you're making dozens of them to come up with a result. And so, [00:06:00] latency. And again, I'm gonna argue that if I look at it as an architect, the latency of the network may not be the bottleneck.

The latency of the actual inference engine may become the bottleneck. But in either case, yes, it's latency more than bandwidth.

Lori MacVittie: So, it’s design.

Joel Moses: Well, it is design. Now, I kind of wanna take issue a little bit with what he said because, that, I have to, right? That's my role.

Lori MacVittie: Yeah.

Joel Moses: The, you know, latency is definitely key, but latency introducing data into the system actually has a magnifying effect, right? So the slower it is to get the data that you need to process, the slower the overall processing time is by a factor of n, right? So, if just getting the data to the point where it can be inferenced is one piece of the puzzle.

And then, you know, you may rely on slower GPUs at some point, but just getting the data in the right place for the inferencing to be performed is still [00:07:00] a cost that has to be paid. So again, it, networking is still hugely important there.

Ken Arora: Yep.

Joel Moses: Latency as a magnifying effect, I think is one of the things. And you know, we're kind of skirting around this issue, but AI workloads tend to have what we hear call data gravity. Meaning the data wants to be local or wants to be pulled near the point of inferencing or the point of training. The efficiency of a system rises the closer it is to the data stores that make up the inferencing decision or are incorporated into the training of an individual model.

Right? So, there's three things that I see. One is bandwidth requirements, 'cause you're dealing with a massive amount of data. The other is data transfer considerations, one of which is latency involved in that. And the other is the need for high performance or higher order networking. And all of these things, to me at least, [00:08:00] suggest that there is going to be a rapid expansion in the amount of data center infrastructure build outs necessary just in the networking components, not even talking about the compute necessary for AI.

Lori MacVittie: Ay, that's important right there, right? As you were talking that was my thought. It's like, yes, bandwidth. Yes, we're talking about right, just choke points is that, the old infrastructure, right. You know, it's like putting in a hub. “Hey, this'll work!” Ah, yeah, no. Right?

People may need to actually upgrade their entire, right, network infrastructure to make sure that it can handle that and process it. Because if it gets overloaded with too many requests or too much data either way, right, the processing time on the actual network equipment is going to impact how fast it can, you know, spit it out and take it in.

Ken Arora: Yeah, so I'd agree with you and I think with training, I'm kind of there. Data gravity is certainly a real thing, right? [00:09:00]

Joel Moses: Yeah.

Ken Arora: And if you've got massive amounts of data, you know, like you look at the training sets for these foundational models, they're immense. They're, you know, terabytes, petabytes of data. They're immense.

And yes, data gravity matters and you've gotta put that data as close to the GPU clusters as you can. Inferencing? A little bit less so. And for many of the use cases of inferencing, that data can be replicated. It's A, not as big in size, and B, much of the time, not always, much of the time is something that you can replicate in cache in location.

So, I think, I can see this going down the path that we saw CDNs go down. If I wanna rhyme. It'll, that things will get closer and closer to the edge. Data, some data, we push down for RAG. Latency is an accelerant for sure, I mean is an accelerating factor. But one way to reduce latency is to not have a few monolithic data centers, but to have more PoPs where you can do the inferencing.

And that's the path I see inferencing going down. It [00:10:00] sorta rhymes with CDNs. In fact, there's a paper I think you get on F5. Mark Manger authored, I helped him with it, on balancing, you know, data gravity with latency

Joel Moses: Yeah.

Ken Arora: with power costs, which is another thing that isn't the topic today. But also trying to, you know, when you're deciding where to put your data center, power cost is a big factor. So yeah, data gravity's a real thing, but inferencing latency trumps data gravity would be my stance.

Joel Moses: Yeah. I don't disagree that a lot of inferencing workloads are gonna move rapidly towards the edge. I've, we, we're already seeing that begin to happen. However, there are two other factors, and you mentioned one of them, and that's augmentation.

It's retrieval, augmented, generative. Where the data sets that you incorporate are done effectively at the point of, a point of inferencing and included as part of the, as part of the AI task to perform. The other piece of this is what happens when, edge-based AI becomes part of an agentic AI landscape, where the system itself is [00:11:00] initiating outbound transactions and pulling data back over to itself or remote controlling other elements.

Those systems are gonna have a need for better and more consistent networking. And a generative AI system at the edge that has an agentic component is not gonna be trustworthy if it is only partially capable of reaching the resources that it needs to operate. So from that standpoint, I, we still need to, we still need to look at the network transactions, even from the edge.

The other piece of networking at the edge that I think is really important for AI workloads, is making sure that you can securely connect edge AI workloads back to a central hub where they obtain, for example, the models that are deployed to them. And that's a whole other topic of course, but it does definitely involve how to secure the internetworking between a central AI hub and lots of different edge inferencing workloads.

Ken Arora: Yeah. Yeah, I definitely agree. [00:12:00] Security and availability and reliability of those connections matter. So reli-, networking matters. I was being more narrowly focused on considering the bandwidth concern. And a lot of those don't need a ton of bandwidth, right? I'm gonna update my data store.

Maybe it's a database, maybe it's a SharePoint document store. Maybe I'm gonna upgrade, update that, but I don't need a huge amount of bandwidth. I do need it to be reliable. I do need it to be secure, for sure. But it isn't, it isn't that. And if I can reduce latency—again, I go back to latency—that's really gonna help 'cause I'm gonna do that a bunch of times.

The flip side, I'm gonna, the contrarian in me will say though, “If I'm looking at the latency and let's say I'm doing some agentic workflow, yeah, you know, it will help me. It'll save me, I don't know, dozens of milliseconds to not have to go to some big data center someplace.

On the other hand, how much does that matter if the agentic inferencing is itself gonna take 300 milliseconds?

Joel Moses: Well, it's additive, I would suppose. So, [00:13:00] I mean, come on, an AI workload is gonna consist of multiple transactions. And if you're adding 12 milliseconds to each 300 millisecond round trip, that's still a substantial amount and measurable percentage of the performance that you're sacrificing

Ken Arora: Yeah.

Joel Moses: by not making sure that that latency is resolved.

Ken Arora: Yeah. I mean, it is 5 or 10%. I'm not gonna throw away, you know, if I can get a 5 or 10% better deal in my car, I won't throw it away. But I'm not gonna fundamentally make buying decisions on that.

Joel Moses: Yeah. Now we, we've been talking about all of these things, kind of glossing over the fact that there's not just North-South agentic AI and RAG transactions or training sets that come from the north-south corridor into a training complex. But when you're doing training workloads there's also intercluster traffic as well.

These systems of course, spread load across each other. They try to create parameter sets by leveraging, you know, widely distributed GPU complexes and the effect of latency there [00:14:00] is deadly. Isn't that correct?

Ken Arora: Yeah. No, I mean, I, when you build a distributed system, latency there does become a huge, a huge thing.

And you know what, I step back a level and it's really about: when do I decide to do that? What sort of workflows do I do? I mean, you know, again I think of training as sort of in many ways the compliment, but the flip side of inferencing. You’ve got a huge amount of bandwidth internally, not a lot of north-south bandwidth, and the latency of that bandwidth externally doesn't really matter so much.

So that's at, in some ways is ideal if you can partition the world into small enough chunks. I'll throw something in between if you wanna comment on that, which is distillation.

Joel Moses: Mm-hmm.

Ken Arora: Distillation is sort of that middle ground between inferencing. And I actually think distillation, building smaller specialized models off of large foundation models, is gonna also become quite popular just as we start becoming more cost sensitive on the cost of inferencing.

Lori MacVittie: Yeah.

Joel Moses: That harks back to another topic we've discussed on this program called Low [00:15:00] Rank Adaptation, but you can go listen to that one if you like. So, let's talk about the east-west, or the intercluster networking. You know, there are of course AI networking standards that are built around doing RDMA transactions over things that are not traditional, traditionally found in, in most enterprise data centers.

And I'm talking of course of things like Falcon or InfiniBand. What do you think about that? Is it necessary to change out your networking simply to pursue the, solving the magnifying effect of latency in these East-West transactions?

Ken Arora: I think, I think if I'm really interested in training, yes. 'Cause training is such, I mean, another ratio. Ratios matter and all these tradeoffs, the ratio of the cost of those GPU cycles versus the cost of networking. Those GPUs, you know, they're this hot, what, what did Lori call them? The hot thing on sale for, uh.

Lori MacVittie: Yeah, the [00:16:00] Black Friday Doorbuster.

Ken Arora: Black Friday, yeah. Right. I mean, they're, these are pricey things. And so yes, I think that if you're in, if you're doing training and you're spending a billion dollars a pop or more on building a data center someplace, yeah, definitely invest in the things that are gonna help you.

I mean, if you even get a 5% or, you know, 10% would be great, improvement on the efficiency of how you use those GPUs. And they're not idling those extra 5% or 10% of the time. That's, you know, if you've invested you know, a billion dollars in GPUs. That's a hundred million dollars.

Lori MacVittie: It's a lot.

Joel Moses: Yeah, it is.

Lori MacVittie: But you know, to Joel's east-west comment, right? There's, you know, I think we're ignoring two things. One, right, agents might talk east, west, north, south, southwest, east-west, right, we don't know. Right? We anticipate that this will change traffic patterns.

That implies there will have to be all sorts of security and access [00:17:00] control, right, mechanisms in the infrastructure that also have to act, which is going to increase that latency and right traffic going back and forth. Especially if they're querying like, you know, identity stores or you know, policy engines, something like that. So, you've got not just the traffic that's coming from AI and data, but you've also got all of this other like management traffic that is collapsing into the data plane.

So you've got control, you've got all of this in the same plane. That's gonna be detrimental to the network if you don't segment and really pay attention to how you're routing.

Joel Moses: Yeah, yeah.

Now, one thing I'd like to point out here is InfiniBand is one of the things that people are doing to assuage the effects of latency. Now, I think that it's probably not such a great idea to choose a completely different networking architecture and standard that is not monitorable or operationalized in the same way that most enterprise [00:18:00] data centers are ready to monitor and maintain it.

Whose physical plant requirements are drastically different than the networking standard that exists North-South, in the pursuit of what effect, what in effect is a couple of microseconds of latency value from a well-designed, say, ethernet network.

We also, of course, have emerging standards: ultra ethernet 1.0 spec was just released a couple of weeks ago, which is aimed directly at extremely low latency enhancements to just traditional ethernet networking. And so I think rushing ahead to try to pursue a couple of microseconds of latency by choosing a completely different networking standard, it just injects a lot of cost into data center infrastructures that in the end analysis, I don't think need to be there in the near future.

Ken Arora: Yeah.

Lori MacVittie: I was gonna say, yeah, I think it depends on the use case. I mean, if you're talking about, you [00:19:00] know, I just have my companion assistant app. Sure, latency, you know, two milliseconds is not a big deal. If I am competitive gaming, yeah, yeah it is. That's the difference between winning and losing right there, that two milliseconds, right? You're done. So, depends on the use case.

Ken Arora: Yeah. Yeah. And in the end it's ROI. Am I better off spending, you know, you're right Joel, a fair chunk of change on saving those few microseconds off? Or is that, are those dollars better spent on something else? In the end, if I take the point of view that the GPUs are the precious resource and I have to do whatever I can to keep them as busy as much as I can.

It could be latency. It could be bandwidth. It could be something else, you know, for F5 it could just be a better load balancing scheme to say, “how do I distribute this workload among the GPUs to get, to squeeze out 5% efficiency?” And that might be a better ROI than, you know, having to spend a bunch of money on InfiniBand. So yeah, I think you're right.

Joel Moses: Well, I guess it's [00:20:00] time to just talk about the three things that we've learned today. And so let me, let me do mine if I could. First, I've learned that bandwidth needs for AI-driven applications are dependent on the application that you're doing. Whether it's a training or it's an inferencing workload, and the type of inferencing workload, whether it involves augmented transactions, that makes a huge difference in what type of networking you select.

On top of that though, I also learned that it's data transfer considerations. Data is king in an AI network and moving data closer to the point of where it's going to be analyzed or moving data closer to the point where you're going to train with it is important, because it reduces the amount of time it takes to do either of those transactions.

And time is key to cost in an AI network. And, it's, I've learned that data centers of the future are going to need to take into account the specific needs of AI applications and are gonna be, need to be [00:21:00] architected that way. Ken?

Ken Arora: Yeah, I agree. I'm gonna piggyback on your, off your last observation. I'll say, and it's more that it is not just all AI applications. All genAI apps are not the same.

What are you, what is your application? Is it latency sensitive? Is it much more training intensive? How much do you use RAG? How much do you use agentic? These are the questions, I learned these are the questions you need to start by asking yourself. And then from there, follow out decisions.

From there, you talk to, again I'm the engineer here, so I'll think about, well, what does it mean? Is it, what's my ratio of compute to bandwidth? What is, how much of my latency is being taken up by my AI application versus my networking? What is my utilization of my GPUs? And then I make a decision.

I've learned and then you make the decisions on what to do. And I guess the third thing, the last thing I think that really struck me is, piggybacking off of Lori's comment of a few minutes ago, we need better nomenclature for these things. We are still working our way through it. East-west, [00:22:00] north-south, you know, we've had a very, for a long time a very simple view of this.

There's a lot more. Do you, did you mean east-west as intracluster? Did you mean inside my data center? Did you mean two Kubernetes workloads talking to each other? What did you mean by east-west? If we had more time, I would've asked you these questions.

Joel Moses: Right.

Ken Arora: We need, but it means that we have, we need better nomenclature. We need a better, we need the vocabulary to talk intelligently.

Joel Moses: Agreed.

Lori MacVittie: Yes. And I, I learned two things. One, I think, right? I mean, the whole application piece, right? The applications are more important to network design, I think, than they ever have been before. We used to classify them in big categories of, well, web apps, and they need this, and you know, these are these other apps and they need that.

But we didn't design networks specifically around the applications, and it sounds like this next move is we have to be a lot more aware of the applications and the resources that they're consuming when we design these networks and how we [00:23:00] implement all the different pieces that go into that.

Which is cool. I mean, everybody, you know, gets an opportunity to design a brand-new network and we'll see what comes out of that. And second that Joel is not a competitive gamer because he thought two milliseconds was not important. So.

Joel Moses: Microseconds, actually Microseconds.

Lori MacVittie: Even microseconds. It's,

Ken Arora: Lori's got fast reflexes.

Joel Moses: Sure.

Lori MacVittie: I, so fast. Right? I, more coffee. But, all right. Well, we won't, we won't dig into that. That's a wrap for this edition of Pop Goes the Stack. If you dug the beat down, subscribe before disruption disrupts you instead. We'll be here, guarding your SLA, and maybe your sanity.

Creators and Guests

Joel Moses
Host
Joel Moses
Distinguished Engineer and VP, Strategic Engineer at F5, Joel has over 30 years of industry experience in cybersecurity and networking fields. He holds several US patents related to encryption technique.
Lori MacVittie
Host
Lori MacVittie
Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.
Ken Arora
Guest
Ken Arora
Ken Arora is a Distinguished Engineer in F5’s Office of the CTO, focusing on addressing real-world customer needs across a variety of cybersecurity solutions domains, from application to API to network. Some of the technologies Ken champions at F5 are the intelligent ingestion and analysis of data for identification and mitigation of advanced threats, the targeted use of hardware-acceleration to deliver solutions at higher efficacy and lower cost, and the design of user experiences based on intent and workflows. Ken is also a thought leader in the evolution of the zero trust mindset for security, and how that will be applied to increasingly distributed and even edge-native apps and services. Prior to F5, Mr. Arora co-founded a company that developed a solution for ASIC-accelerated pattern matching, which was then acquired by Cisco, where he was the technical architect for the Cisco ASA Product Family. In his more distant past, he was also the architect for several Intel microprocessors. His undergraduate degrees are in Astrophysics and Electrical Engineering, from Rice University.
Old is New Again: Bandwidth will be the AI bottleneck
Broadcast by