CDI-Info/311 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

All right, so our first panel today is talking about the challenges and solutions when it comes to scaling AI. Specifically, we talked about the Open Systems for AI initiative. One of the big challenges there is scaling interconnects and memory technologies, so that's going to be the focus for today. I thought that we would kind of look at that topic through the lens of use cases, right? You know, we know that the challenges are different for training compared to inference, and there are different phases of inference. So, hopefully, we can get into some of that in our discussion today, and then maybe touch on some of the challenges of integrations and stuff like that in the minutes ahead. So, we have to spend—maybe, Chris, I'll throw it to you first. Do you want to maybe introduce some of the challenges that come up in scaling interconnects and memory through the training use case?

Sure. So, for training, we've talked a lot about the rate of growth of the models that we're seeing. Often, there's a doubling of the model sizes over the past six months or so, and that trend has been continuing. The other thing that we're seeing is that there's a rate in terms of the number of models being released on a monthly basis. Dozens of major models get released monthly, and often those spawn thousands and thousands of derivative models as well. Obviously, that puts a ton of pressure on building this training infrastructure, and that creates a whole host of challenges. Today, we're just going to focus on three to start with. The first one is: how do we continue scaling these clusters as these models continue to scale? Next, how do we improve reliability as these clusters continue to grow? Finally, how do we address some of the power and cooling limitations that come with all of this? As I mentioned, the model size continues to grow, and one of the key issues here is that we need to think about the implications of this on the overall infrastructure design. How do we keep pace with that growth?

Yeah, I think you know, right now the data center is going through a revolution. I think Omar mentioned that you really have this scaling challenge, and as Chris just mentioned, I think it's really paramount that you have to evolve your infrastructure. At Marvell, we like to call it AI for AI—accelerated infrastructure for artificial intelligence. And really, there are two main pillars for this. If you think about it, right? You've got your Pillar One, which is the accelerated computing, and I think that gets the most hype—GPUs, custom AI accelerators—and then you've got your memory subsystems behind it, HBM with the CPU potentially, and DRAM. Then, you actually have the second pillar, which is clustering all these compute elements together, which we like to call connectivity. It's really the networking tissue that puts these GPUs and AI accelerators together to really scale. And as Omar and ian just mentioned, it's becoming a big challenge, not only within the data center but between data centers. I think as we move forward, this combination of accelerated computing and connectivity is really the challenge, right, Chris?

Yeah, absolutely. And I think, as a part of that, not only do we need to accelerate, but we need to think about how we build purpose-built solutions for AI specifically. You know, this whole Gen AI effort has really pushed the envelope in terms of how we need to think about hardware design, server design, and even rack design as well. In the past, we've been stuck with some kind of fixed ratios when it comes to server design, like the GPU-to-CPU-to-NIC ratios. These have been fixed, but we need to think about how to enable better flexibility going forward. We need more modularity; we need to evolve these ratios to continue scaling these clusters. We also need to think about switches differently. In many cases, we've had to use general-purpose switches, but now I think it's at the point where we need to start developing purpose-built AI switches that will help us continue scaling these fabrics. Additionally, we need to think about the fabric itself. For the scale of fabrics that Ian, for example, talked about, we're at a rack-level fabric now, and it's easily spanning into multiple racks. Of course, we can't forget about memory in this context either, especially in the training context. There are certainly a lot of memory bottlenecks we need to consider. If you look at how the core base memory component—the DRAM itself—has been scaling over the past few years, you can see the density increase going at a rate of doubling every few years. However, with model sizes doubling in a matter of months, we need to think about how to continue scaling memory capacity and overall memory footprints. This goes beyond just adding more HBM to these GPUs; it's about storing models beyond that. So, we need to start thinking from a hierarchical perspective as well.

Yeah, so I completely agree with Chris's point regarding the need for a flexible and reliable AI solution for processing massive data. To build a scalable system for AI training, I believe that resource disaggregation might be a good approach, and domain-specific accelerators can play a significant role in enhancing performance in this context. For example, approaches like in-switch computation can significantly reduce collective communication overhead, and processing near-memory or in-storage processing can minimize data movement. By the way, building and standardizing this infrastructure is crucial to leverage resource disaggregation and domain-specific accelerators. This initiative will require substantial industry communication and contribution from various countries' industries for widespread adoption. I believe that OCP is playing a critical role in advancing this ecosystem by defining key building blocks, spanning electrical or optical interconnects, switching, the operating system, and the reference board design for all compute elements. Anyway, this is a systematic challenge. The reality is that there is no one-size-fits-all solution. The OCP-based composable building blocks, which are optimized and customized, are very critical. Although we can resolve some of the issues, there still remain challenges in distributing large datasets across the disaggregated resource pool and subsequently aggregating the results. This process involves different parts of the memory and interconnect hierarchies.

So, hierarchies are actually a topic that's frequently talked about with respect to memory. And interconnect is obviously interwoven into that. When you talk about, for example, doing more than just adding more HBM and disaggregation, you're from Samsung. Maybe tell us a little bit about the challenge of understanding memory technology through the hierarchy, because I know there are a bunch of engineers here from different backgrounds. Can you give us the 101 on that?

Sure. As you know, computation is fundamentally the process of digesting data. So, as your computational capability increases, the memory bandwidth and capacity should also grow to feed data in time. But the problem is that it is very challenging to keep pace with the scaling speed of logic processes and the architectural innovations for the computing unit, like a tensor core or 3D packaging. The problem is that the traditional memory hierarchy pyramid of SRAM, DRAM, and storage needs to be made deeper and more complex. New memory hierarchies are emerging, including HBM (high bandwidth memory), positioned between SRAM and DRAM, and CXL memory, which bridges main memory and storage. HBM can address the limited capacity and bandwidth issues of both SRAM and DRAM. Unfortunately, HBM capacity scaling is much more difficult than HBM bandwidth scaling, especially with innovative approaches like bufferless HBM, which can be stacked directly on top of a computer. This disaggregated memory pool, offering ultra-high bandwidth, is a promising candidate for bridging this gap. Technologies like CXL, UALink, or NVLink are strong candidates to efficiently handle data. Offering speeds up to 400 gigabits per second can be a viable solution.

Makes sense.

So, when we think about the memory hierarchy and the fact that we can't keep all of the memory close to the computing elements, essentially what that means is we have to distribute it across, right? So, that ultimately means we need interconnects to be able to do that. And it's actually interesting to think about interconnects as a hierarchy of sorts as well. So, we tend to think about it in terms of scale-up and scale-out. Scale-out and scale-up in this context essentially mean that if you take a whole group of compute elements, GPUs, accelerators, and similar devices, and you aggregate them into what looks like one very large GPU or one very large accelerator, that's traditionally what we mean by scale-up. Now, if you want to take this and distribute it horizontally across many more systems, then we're talking about scaling out. The interconnects that TaekSang was just talking about are often focused on the scale-up piece of the equation. To be able to attach memory into this, we can think about how CXL could potentially help us. As you probably know, most GPUs have two high-speed interconnects on both sides of the chip, right? There's a front side, which allows you to connect the GPU to the CPU itself, and then there's the back side, which connects the GPUs to each other. Interestingly, this gives us a couple of potential attachment points for different types of memory tiers. Let's take CXL as an example. Even if the CPU itself doesn't necessarily speak CXL, the CPU does. Since the GPU talks to the CPU anyway, the GPU can communicate with CXL attached off of that CPU in the same way it would talk to natively attached memory on that CPU. This gives us some flexibility in terms of where we can attach these new memory hierarchies.

Yeah. And as you look to scale these larger and larger models, one must use parallelism, right? It's impossible to fit a whole model from a training perspective, or the parameters, into one HBM memory set. So, as you start to scale these models, you really start to move to approaches like the database went through, where you're sharding the data across multiple elements and really distributing the computation across all the different elements: the compute elements. You have your model parameters, your optimizer states, your gradients, and so on. So, many different types of parallelism are required going forward, right? And just a couple of months ago, I think in August, Meta actually released their Llama 3 paper on this, and they talked about all the different parallelism methods they used, right? Spanning pipeline parallelism, tensor parallelism, and obviously, data parallelism. And I think as you look at this... Yeah. I think that's a big problem. When you have hundreds of thousands of GPUs, going to millions of GPUs and AI accelerators, you start to get tens of millions of interconnects. And when you start to introduce all those different elements and all those different interconnects, there's going to be more potential failure points. And this really drives the point that reliability is going to become more and more critical. And I think that's a key thing that we've got to work on, Chris.

Yeah, absolutely. I mean, the whole objective here is, as you're running a training job, for example, you need that job to complete in a reasonable timeframe, and you need to do so with efficient utilization of the underlying hardware as well. If you don't have strong reliability built into the components, into the interconnects, and so forth, it makes that very difficult, and ultimately you'll end up underutilizing some of those extremely expensive GPUs, right? So, that's not really a scenario you want to work on. Yeah. So, the key here is we need to think carefully, and we need to build intelligently with the right RAS — the reliability, availability, serviceability mechanisms — and the right level of observability in the hardware and the interconnects themselves so that we can improve reliability as we continue to get to these tens of millions of interconnects. And, as a part of that, that telemetry needs to be built into the retimers, the cables, the switches, and so forth. Which is something that our Cosmos suite of software, which we run on our products, provides you the capability of. And then, of course, we can't forget memory, right? Anytime you attach memory, you need reliable and strong memory correction capabilities on that memory as well. And that's, of course, something you have to think about when you use CXL in this context as well.

Yeah. The final thing that I would like to cover is the data center coding system. The power consumption by the AI infrastructure is well known, but the energy demand for the cooling system is surprisingly high. The PUE (Power Usage Effectiveness) is the key metric to indicate the ratio of power consumption attributed to the IT equipment compared to the actual cooling system or other infrastructure. When considering the... Just a traditional air cooling system. Then the... When the average PUE is around only 1.7. But with direct liquid cooling, the PUE improved to about 1.2. And with emergent cooling, it can drop into the low ones. This improvement suggests a potential cost saving of approximately 30% compared to traditional electricity expenses. So optimizing the cooling system is very, very critical to enable an energy-efficient data center structure.

Okay. I think we hit all of the challenges that Chris laid out, right? Scaling AI clusters as models scale, improving reliability as we scale, and touching on power and cooling, which is, I know, another topic that we can get into. But that gives a good breadth of some of the challenges addressed with the training use case. I know that... we talked about memory capacity and interconnectivity. Yeah. And the way that it can connect bandwidth as themes, and lots of other stuff there. But as we think about the inference and model serving use case, I've heard, obviously, that latency is another big performance requirement out of it. So with that, maybe I'll throw it to you, Nigel, to talk a little bit about the challenges with inference.

Yeah. I think the whole user experience starts with inference, right? So, no one wants to wait for the response. So, I think latency is paramount, and it's a critical problem that we've got to solve. When you look at inference, it's a memory-bound challenge, meaning there's not enough memory capacity per AI accelerator or GPU to actually support the model sizes that are... the parameters that are rising right now, right? So, I think there are many ways to fit more memory, more parameters into the memory. There's pruning, and there's actually coming up with smaller model sizes. But going forward, there are going to be some innovations as an industry. As an example, you can quantize these types of models. One thing I'm really super excited about is that at OCP, they came up with the Open Floating Point 8 to help the industry move forward, right? And I think as you continue to see these types of collaboration as an ecosystem, it's paramount to advance the whole architecture going forward. And I think, as we look forward, latency is a bigger issue on the inference side than on the training side, right? Because training is done behind the scenes. The user doesn't really get to see it. Obviously, all the hyperscalers and the people paying for those model trainings are worried about that and want to get their models done as quickly as possible. But from a user perspective, you don't see all that, right? We only see this on the inference side. And I think, as you start to see new models, like the OpenAI 01 or the Strawberry model that just came out, which is more multi-step, the infrastructure must evolve. Wouldn't you say so, Chris?

Yeah, absolutely. In fact, inference is kind of an interesting problem because there's this tension point between not only being sufficiently responsive—in other words, to your point about latency—but there's a large amount of memory capacity and memory intensity here as well. So let's take a concrete example. Let's take a look at some of the GPT-like apps, for example, that Omar was showing earlier. When you're interacting with those, you, as an audience, have the expectation that there's a real-time response to this. You also have the expectation that the response you get back is both insightful and sufficiently accurate. What happens behind the scenes, though, is that to provide that sufficiently accurate and insightful response, it requires quite a lot of data—not just the model itself, but a lot of other data to support the model. So when we talk about this, we refer to things like context, and the context is what helps us provide that sufficiently detailed and accurate response. But the challenge, of course, is that you have all this data, but you also need to respond very quickly. Today, that memory capacity often exceeds what you can attach directly to a CPU in terms of memory. A lot of times, that ends up spilling into SSDs, for example. So that's a big challenge. SSDs represent another memory tier, but of course, SSDs are slower. That then potentially impacts your latency again. So we have this challenge where technologies like CXL memory could start to play a role in helping us balance both the performance and capacity aspects.

Make sense. Do you want to weigh in on this, too, TaekSang?

Yeah, actually we have started with the size issue. Memory capacity will still be a big issue due to the variety of model types and data formats. The retrieval-augmented generation, or RAG for short, introduced memory capacity issues as all the context needs to be saved in memory in an indexed vector format. Considering the difference in size between the vector database and the memory capacity, I believe that CXL memory with a processing-near-memory engine might be a good solution to bridge this gap.  For similarity search operations, like a k-nearest neighbor search in a vector database, the processing-near-memory engine can read large amounts of data from locally attached memory, check the nearest neighbor, and return only a few indexes to the accelerator. This can reduce CXL bandwidth and power consumption by several orders of magnitude.  In addition, what I want to really emphasize is the CXL memory pool access latency. One of the industry's concerns about CXL memory pools is their longer latency compared to 1-hop NUMA, which can stall CPU pipelines. However, AI memory for GPUs is less sensitive to latency because it allows bulk data prefetching. Therefore, all memory connected to the GPU pool through CXL, UALink, and NVLink is very suitable for transferring bulk data without concerns about latency issues. This can alleviate the industry's concerns about current memory pool access latency, in my opinion.

OK, thank you. There's a lot of great technical detail here, which I love about the OCP community, that we're able to touch on in a short span of time. Understanding that although there's a drive for more compute, there are use cases, like inference, where the compute is actually waiting on memory. And the things that we are starting to think about to solve that.In wrapping up, why don't I go through the panel one by one, where maybe you guys can say a few words about your organizations, what's happening at Summit, where people can find you, and specifically what you are doing to tackle the challenge of scaling AI, memory, and interconnects.

Yeah, so why don't I start? So really, I think at a high level, we're really in the first inning of this whole revolution, right? It's so early in this new era, which we like to call the learning era, driven by AI. This era is going to be larger than all the others we've had—be it the PC compute era, the internet connect era, the mobile move era, or the last cloud serve era. If you put all those together, AI is going to be much larger. So we're super excited at Marvell to be enabling this accelerated infrastructure to scale AI and working with the community. There are multiple areas we're working on. We didn't get into chiplets here either; that's another key area for this whole memory stacking and memory hierarchy. So I think we have multiple presentations going on this week. We have a nice booth that's showcasing a lot of our technologies in action, so I encourage you all to come check it out.

Cool. How about maybe throwing it to you, TaekSang?

Yeah, Samsung is at the forefront of innovation to empower AI with it. Next-generation Samsung memory solutions. All our innovative products for high bandwidth and high capacity memory solutions, including CXL, near-data processing, HBM, GDDR, and all storage solutions, will be showcased at our exhibition booth. They will also be shared through the expo hall, executive sessions, and breakout session presentations. Samsung is committed to delivering all innovative solutions through industry partnerships by building the OCP ecosystem.

Thanks. And Chris, the last word.

Sure. So, I mean, as we're rethinking this hardware design in this complex and scaling AI era, we really need to be thinking about how we can move faster. How do we design with software in mind? How do we scale farther? All of that while we're trying to push the envelope here in terms of power, bandwidth, latency, efficiency, reliability, and security. Right? So there's a lot there, and it really requires this whole OCP ecosystem for us to work together and make some of this happen. At Astera Labs, we're really into this purpose-built and software-defined approach, and it's really part of our DNA for the AI era. You can see that implemented across our retimers, our cables, our CXL controllers, and really in our newest portfolio here with our switching products as well. So, we invite you all to please come visit our booth, B13, see some of our demos in the Innovation Village, and then see the 10 or so presentations that we've got going on throughout the summer. So, thank you.

Brilliant panel. Give these guys a round of applause. Thank you.