CDI-Info/397 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

Hi, my name is Torry Steed. I'm a Senior Product Marketing Manager at Smart Modular Technologies, now also known as Penguin Solutions. And today, we're going to be talking about a CXL market perspective and how you can get ready because CXL is coming. And let's talk about how it's here and how it's growing.

So, a little bit of background first. So, there have been some major technology shifts, paradigm shifts, if you will, throughout the ages, right? We started in the 80s with PCs, moved on to the internet in the 90s, mobile in the 2000s, and cloud in 2010. And now, we're kind of in this AI and machine learning era, right? And that's a major technology shift. AI and machine learning are pushing forward technology, both on the hardware and on the software level, all across the industry. I do want to mention two other applications that are kind of in a similar ballpark that play into that AI machine learning space as well: high-performance compute and also in-memory databases. And all three of these categories of applications are really driving the industry forward, especially when we talk about memory, because all of these applications have a high demand for memory performance and also for memory capacity.

So, with that high demand, what happens? We hit what's called the memory wall. So, many of you may be familiar with this concept, but it's the idea that the performance of processors has been outpacing the performance of memory for many years now. And as we continue to develop processors with a higher and higher core count, that problem just continues to get worse. And a lot of times, people think of this gap between processor performance and memory performance, kind of in a speed and a latency way, but I also want to mention that it's also very important to consider the memory capacity as well, because as we've continued forward, we're becoming more and more limited on how much memory capacity we can fit into a server. And that contributes to the memory wall as well, especially for those AI applications, those really high memory-intensive applications.

So, what do we do? So, traditionally, the way you scale memory is we've had memory DIMMs for many, many years, decades now, and you just put more memory, more DIMMs in your server, right? You attach more. Sometimes you have a dual-socket server. You can even have a quad-socket server, right? And you can fit more and more memory DIMMs or, to a point, you can scale up the amount of memory on those DIMMs themselves. For example, in DDR5 today, using 3DS technology, we can get up to maybe 256 gigabytes; probably 512 is coming. So, you can scale that way. And once you've filled your server, you really need to talk about scaling by adding additional servers. That's really the only way to continue scaling memory, even if you don't necessarily need that compute. Maybe your processing is happening on GPUs, or maybe your processing is not the language. That's the limiting factor. The limiting factor is really that memory limitation. So, the interesting thing, so another way to solve that problem is with relatively new technology, CXL. You've been hearing about CXL for a while now, but we're going to get into some more of the specifics. So, CXL is a way to add additional memory capacity and capability to a server, either directly in the system or through a memory pooling appliance. We'll get into that. We'll get into the details of how that works coming up. But, so CXL is emerging, is an emerging technology, and it's got these different form factors of memory modules that can be used in a server or in that memory appliance. And so, there's a couple, several different options that we'll talk more about in detail coming up.

So, just real quick, what is CXL? So, CXL is just a set of protocols: CXL.io, .cache, and .mem. All of which operate on the existing PCIe hardware interface. And depending on how you mix and match those protocols, you can create several types of devices: Type 1, Type 2, and Type 3. Not very creatively named, but that's what they are. So, Type 3 is actually what is being first introduced into the industry. And Type 3 devices are those memory expansion devices. They're really all about adding additional memory capacity and memory bandwidth to a server or to an appliance. Soon, and people are already working on it, but soon we'll have accelerators that are connected via CXL. Things like smart network interface cards. Those are coming too. But right now, there's a lot of emphasis on the Type 3 memory expansion, which is what we'll be focusing on.

OK, so what are the benefits of CXL? Number one benefit is increased capacity, right? You can. As I talked about, once you fill up those DIMM sockets in your server, that's it. You're kind of, that's as much memory as you can apply to that CPU. Well, with the addition of CXL, it effectively allows you to have additional DIMM sockets that you can populate for that CPU. So, the memory scales significantly. You can also reduce memory costs with CXL. The way you do that is instead of paying a premium for very high-density DIMMs, like specifically 3DS DIMMs or stacked DRAM die, those cost a significant amount. They cost more to manufacture. There's a limited amount of supply, so there's a premium for those. So, you can actually avoid buying those very expensive modules, buy more mainstream modules, and use CXL to get those additional DIMM sockets. And you can keep your overall cost of memory on a server down, which is pretty significant. Finally, once you've maxed out your memory channels, your memory controllers in a CPU to access the DIMMs in the traditional way, you can actually increase performance by adding memory on the PCIe interface via CXL as well. It gives you an additional way to access memory when those memory channels are suffering from very high-load applications where the memory controllers, there's a queue. All the cores are queued up and waiting to access that memory controller, this offers an alternative path, and there's some applications that can really benefit from just from that additional bandwidth.

Another caveat is because we are adding a CXL controller into the picture, CXL introduces or creates a new tier of memory. A lot of people show the picture on the left, which is kind of the memory pyramid there. And you can see easily where CXL fits into that stack. But I've drawn it on the right-hand side as a logarithmic scale. And what I want to point out is, even though CXL does add some latency above traditionally attached main memory in a DIMM, it is significantly, like orders of magnitude, significantly faster than flash, right? So, what one of those performance benefits you can get is if you can avoid, if you have enough memory in the server that you can avoid having to go to swap or paging, then you can get dramatic performance improvements, And I'll show some details on that coming up.

Okay, the CXL is kind of rolling out in these different generational leaps, right? So, the first generation of the CXL, which is the 1.1 spec, allows for that memory module that's attached directly to the CPU that we've kind of talked about. CXL 2 actually enables one level of switching so that now you can build an appliance that's in a rack that talks to multiple servers. So, you have kind of that first instantiation of a memory pool. It's not necessarily dynamic, so you have to assign memory to the different servers, but it allows you to share memory across servers within the rack. And then, what's coming soon is CXL 3, which is a fully disaggregated memory solution. This is one that people are really excited about. It avoids problems like stranding memory in a server and allows you to dynamically allocate these pools of memory on a job-for-job basis, which has a lot of potential. I'm excited about that, too. But I really want to stress that the first two methods of adopting CXL offer a lot of advantages. And they're available right now. They're available today. So, that's pretty exciting also. And I'm going to show you some real-world examples of those use cases right now.

So, here's kind of a couple of examples of those two. So, the one on the left is an example of some servers that today support CXL memory expansion. So, the one in the top supports the E3.S form factor. And that supports those. They're kind of the replacement for the U.2, if you're not familiar with them. It's an updated version of that. So, they're designed to go in the front of the server. And you can install quite a few of them across the server if it's configured for that. So, that's one way of installing them. And then also, we have a lot of servers in the industry today that are obviously focused on supporting AI. And those are designed to support GPUs. So, those GPUs are in a PCIe CEM form factor or an add-in card form factor. And one of the nice things about CXL is you can get add-in cards that are in that same form factor. So, now you can kind of mix and match GPUs in a server and balance. Well, do I need more GPU processing, or do I need more main memory? And you can kind of trade off between those two. Or you can even build a server that has tons of CXL memory in it if you don't necessarily need that GPU processing for training. Maybe you need to hold a massive database in a server in a single node so that you can run inferencing, for example, with a massive database behind your model. So, there's some neat tradeoffs you can play. And that's the simplest way to adopt CXL today. And you can buy servers, and you can get CXL memory expansion modules today that will work together.And the next one on the right is that CXL2 kind of switching model. So, this is not quite yet fully disaggregated, you know, one memory appliance that supports the entire data center or multiple racks. But what it is, is there are multiple companies that are developing an appliance that has a CXL switch and that can service maybe two, four, six, eight servers in the rack. So, you can think about that as a memory pool that can service the servers within the rack, servers within that local rack via a CXL interface. So, that is available today in some way, in some cases, but also, you know, available very, very soon.

Why has CXL taken so long? So, you've probably been hearing about CXL for years now. And it's taken a long time to get to the point where we are today. Part of that is because the spec was defined years ago. The original spec came out, I think, in 2019. But we needed the spec to be in place. We needed CXL controllers, which, as you know, ASIC development takes a significant amount of time. So, we need CXL controllers to be in place. We needed the CPUs to support CXL. And we also needed the rest of the infrastructure, BIOSes, and operating systems, all to support CXL. And all of that had to come together. The industry has been working on it for a long time. But we're finally in a position where all of those are coming together here in 2025, where we can move beyond kind of the proof-of-concept stage into ramping and mass production and actually getting good use out of this.

So, here's some of the pieces that had to come together. We've got AMD and Intel processors that are supporting CXL, BIOS that's supporting CXL, as well as operating systems that either support it today with the latest versions or are in the process of implementing CXL. So, it's finally coming together.

Okay, so what does that all boil down to and look like in terms of a market perspective? So, this is from Objective Analysis. And this kind of highlights what I was talking about earlier. 2024 was kind of the proof-of-concept, still-ramping CXL. A lot of people were getting samples and trying experiments and testing and seeing how it performed. And now, we're really entering, you know, leaving that stage and getting into the production stage of CXL. And those production systems being available off the shelf. And this particular analysis, is pretty realistic. And it's estimating that eventually up to 30% of servers will have CXL attached in some way, whether that'll be directly in the server or they'll be attached to a CXL memory appliance of some kind. So, that's going to be a significant portion of memory is going to be using CXL in the future.

Okay, at this point, we're going to transition a little bit. We're going to talk about some of the products that are available today and specific use cases of those products. And then, we'll talk about some performance after that.

So, this is just a high-level view of some of the products that Smart is offering. And we have a couple of add-in cards that are both DDR5 add-in cards. And we have a couple of E3 modules as well.

So, we'll talk about each one of them a little bit. And then, we'll move on to some use cases and some performance.But the first card is a DDR5 add-in card that supports four DIMMs. So, each of those DIMMs can be up to 128 gigabytes. So, if you fully populate this card, this card adds a half a terabyte to the system. It's a full-height, half-length, single-width card. So, this is a good card to use in a 1U or a 2U server where add-in cards can be installed horizontally via a riser card. So, this fits well. It fits well in those types of systems. It is a full x16 interface to a single CXL controller. And this one is in early production now from Smart. So, it's available.

And then, we also have an 8-DIMM add-in card. So, this is a full-height, half-length, dual-width card. So, it's that double-width card, very similar to a lot of the GPUs that are available today. It exceeds the 75-watt maximum power limit that you can get from those gold connectors. So, it has an auxiliary power connector, much like many of the GPUs used today. And in order to support eight DIMMs, we have to use two CXL controllers today. So, this one actually plugs into a single slot. But logically, it requires bifurcation of that CXL bus. So, you really want to look at more of the current generation of CPUs like Granite Rapids and Turin to support the bifurcation of the CXL bus. This one also supports up to 128-gigabyte DIMMs. So, altogether, this card is one terabyte of memory. So, adding one of these cards into a server adds a terabyte of memory, which is pretty awesome.

Okay. The next one I'm going to talk about is an E3.S factor. So, there's actually several memory companies out there making modules like this. This one supports up to 128 gigabytes of memory. It uses a single CXL controller with an x8 interface. And it's going to be sampling. It's sampling now, but it will be in production soon.

And finally, I want to talk about kind of an interesting approach. So, Optane was around for a while to support persistent memory. There's also Smart Modular and others make different types of NVDIMM technology, so non-volatile DIMM technology, right? And then, that's not proceeding forward into DDR5. The system vendors have decided that it's not something they want to add the hooks in in DDR5 timeline. So, instead, what we're looking at for non-volatile memory is using CXL. And there's some real advantages to doing that. So, this is our first-generation non-volatile module. This is 32 gigabytes of DDR4 because it's intended to replace one of those non-volatile DIMMs. So, that's a typical density. And what happens is the host, you know, connects to this thing and treats it just like a CXL memory expansion device, right? So, it plugs into the front-end server. Hey, great. I've got an extra 32 gigabytes. It's full speed, but I know it's non-volatile. It shows up as persistent memory. And then, if the system has a critical fault or a power failure, the controller on board takes over the DRAM, copies all that data onto NAND flash, and it uses on-board power. So, a lot of times, the NVDIMMs had a cable that would run over to another power supply or something. Not necessary. This is all self-contained. So, it powers it long enough to copy all the data into NAND flash. And then, when the system comes back up, copies all the data back into DRAM. And you're off and running as if you've never lost data. So, this is really good for things like checkpointing or storing reference tables in the non-volatile memory for a database or change logs, things like that. Because it's essentially full-speed memory that's just persistent.

Okay. And then, I do want to talk a little bit about some of the differences. I know a couple of add-in cards and a couple of E3 form factor CXL memory expansion. I want to talk about some of the things to consider when you're looking at the two options. One thing to keep in mind is that the E3 is a smaller form factor, which is good. And you can potentially fit more of them in a system, especially a smaller system. But because of that smaller form factor, you are more limited on the capacity. So, capacity is typically going to be 128 to 256 gigabytes in that form factor, which translates, if you're assuming an x8 interface, it translates to 32 gigabytes per lane of CXL, right? Because you've got eight lanes, 256 gigabytes. So, it's 32 gigabytes per lane. Compared to an add-in card, which, because it's larger and because you're using off-the-shelf DIMMs, you know, 128 gigabytes per DIMM, you can actually get a lot more capacity in that form factor. These are typically x16 interfaces. So, in this case, the 8-DIMM add-in card, which  today supports a terabyte of memory, should soon be able to support two terabytes of memory. And if you get to that level, two terabytes of memory divided by 16 lanes gets us to 128 gigabytes per lane. So, one of the limiting factors when you're designing a system is how many PCIe lanes. You need some for storage. You need some for networking. You need, you know, you need those PCIe lanes are pretty precious. So, maximizing your memory capacity per lane might be something that you want to consider. And in that case, the add-in cards offer a more efficient use of those lanes. Cost per gigabyte also. So, because you have to do a CXL controller and some other components in order to build these modules, there's a little bit of overhead involved compared to a standard DIMM. And so, because you can get a lot more capacity on the add-in cards, your cost per gigabyte is actually a bit lower compared to the higher cost of the E3s. Power, there's different power budgets just because of where these are installed in the system. And the other thing to keep in mind is definitely the E3 has an advantage on serviceability and access to them, right? Because they do install in the front of a server. They're much easier to service or to access if that's something that is critical to your use case. Whereas the add-in cards are installed typically inside the server. So, you have to open the server up in order to gain access. So, just some things to keep in mind when you're looking at the tradeoffs between the different form factors of CXL that are emerging.

Here's a little bit of an overview of some of those applications that really are going to benefit. These are the applications that are hitting up against that memory wall that we talked about earlier, right? The one obvious one is in-memory databases. This is pretty simple. If you can keep the whole database in memory, whether that's direct-attached memory or CXL-expanded memory, or a combination of the two, you're going to have significantly better performance than if you have to swap or go to IO. Some examples include SAP HANA, Redis. So, that's definitely an application that can benefit from CXL. High-performance computing, things like climate modeling and biotech, right, are working on. It's not true AI in the sense of large language models and things like that. But that high-performance computing, the simulations are using larger and larger datasets. So, that's something that CXL can benefit as well. And FinTech Financial is very similar. But then, obviously, that AI and machine learning, which we're talking about today, can really benefit from additional memory, right? Both memory performance, memory capacity, anything you can get. Because yes, you have those GPUs that are maybe training or maybe running your model, but you've got to feed them. So, a lot of times, you have maybe a very large database that's available to feed those AI models. And that's where CXL can really keep your costs down and also allow you to have larger databases than you might otherwise be able to support.

Okay, talked a lot about theory. We talked some about what's available today. Let's show some examples of what people are doing with CXL and what is really available and how it works.

Okay, so let's talk a little bit about latency again. So, we talked earlier about direct-attached memory. And I'm kind of diving in a little deeper about where the latency comes from. So, obviously, you've got a memory controller in your CPU that's talking to RAM. And some of the latency is caused by the memory controller. And some of it's caused by the memory itself. And so, you typically see around 100 nanoseconds of round-trip latency. So, writing data out to the DRAM, reading it back in, about 100 nanoseconds. When we talk about CXL, you have the addition of that CXL controller. So, it does add some latency, right, which we talked about. Some of it comes from the CPU and the PCIe root port and serializing that data. Some of it comes from the CXL controller itself as it translates that data back into a parallel interface for the DRAM. And then, you still have the DRAM latency. So, as I'm showing here, you get something like 170 to 210, maybe 250, you know, somewhere in that range of latency, which is more than direct-attached memory. However, keep in mind that a lot of times with direct-attached memory, you're doing things like talking to memory on an adjacent CPU, for example, via a NUMA hop, right? So, a NUMA hop will add a very similar latency to what we're showing here for CXL. So, the infrastructure is already used to some variation in the CXL in this order of magnitude. Similar when you add a CXL switch into the equation, that adds further latency, but you're still in that kind of hundreds of nanoseconds order of magnitude, right? Versus an SSD where you're talking about tens of thousands of nanoseconds most likely. So, we've done real-world measurements on these with one of our Fortin add-in cards. This was on, I believe, an AMD system. It varies depending on which system, which BIOS, which, you know, configuration, which processor, but this is pretty typical. We see just over 200 nanoseconds of measured round-trip latency using MLC, Memory Latency Checker from Intel. And so, that's lining up right with where we expected it to be for direct-attached CXL, direct-attached CXL memory expansion. So, pretty solid result.

Okay, let's talk about bandwidth. So, a typical memory controller, if it's talking to a single DDR5 6400 module, will be in the neighborhood of 51.2 gigabytes per second of bandwidth. Now, that's read bandwidth or write bandwidth, right? It's unidirectional. You can't read and write at the same time on DDR interface. And then, when we look at CXL, a card like this has a x16 interface, which has a theoretical maximum of 64 gigabytes per second. When we measure, we measure with different combinations of reads and writes, also using MLC, we see high 40s to low 50 gigabytes per second as well, which is actually right about in the same ballpark as a DDR5 interface as well, which is pretty good too. So, we're happy with the results there.

So, here's another real-world use case, right? So, this is actually from a white paper that Micron and AMD did. And this is a very summarized version. But essentially, what they did is they took a server that had about a terabyte of memory in it and did some tests. And then, they took the same server and added about another terabyte of CXL memory. So, now you've got a two-terabyte server and ran three different use cases, three different applications, and memory database, machine learning, and high-performance computing, right? It should sound familiar. So, for the in-memory database application, it's not memory bandwidth limited. It's only consuming about 50% of the memory bandwidth. So, it's not relying on that. It's not particularly sensitive to latency either. It really just cares about capacity. And so, by adding that additional terabyte of capacity, they were able to reduce the amount of paging I/O that they had to do to SSD by anywhere from 40% to 80%, you can see there, which resulted in an overall performance improvement of 23%, which is gigantic. For the machine learning, so this is right in the sweet spot of what we're talking about today, right, for AI. And this one is also a memory-bound workload, meaning it could benefit from more capacity of memory. But it is very sensitive to latency, right? You don't want to just suddenly have a huge amount of latency introduced into the system by having to go out to SSDs or storage. So, in this case, by increasing the memory capacity with CXL and still being in that same ballpark of latency, they were able to improve or increase performance by more than double, which is pretty fantastic. And then, finally, for the high-performance computing, similar story, but this one is a bandwidth-intensive workload. So, it was not limited by the amount of memory. It's really, really focused on bandwidth. So, by fine-tuning the data that was mapped to the direct-attached DRAM and having a little bit aimed at that CXL memory, they were able to optimize the performance of this specific application and take advantage of that extra bandwidth that CXL offers to get a 17% performance improvement when running that Cloverleaf example. So, across the board, some cool performance benefits you can get running CXL.

A couple more. So, we're working with a university in Taiwan that is doing a study on large language model training. And what they did is they showed how a server with 512 gigabytes of direct-attached memory runs, that's the blue lines, and then compared it to a server with also a total of 512 gigabytes of memory, but half of it is direct memory and half of it is add-in cards, CXL add-in cards. And you can see it's virtually identical. So, you go, okay, well, it's fine. CXL isn't really getting us anything, but what it is doing actually is it's showing that at scale, you'll be able to gain those other benefits. You'll be able to keep the cost of the memory down. And if you scale up high in memory, you'll be able to run models that you wouldn't be able to run in a single node because you're going to have higher capacity of memory, right? They could have also taken the same system and not taken out the original 512 gigabytes of memory, just simply added an additional 256 gig, and then you would have had a larger memory pool to play with and been able to run larger models. So, it's a step in the right direction.

This is the same university that's working on LLM training with CPU offloading, which is where you take some of the data that's being stored in the GPU and offload it to CPU main memory. And they were able to show that adding CXL increases the total amount of system memory, right? Which allows you to increase your batch size, which obviously results in dramatically improved performance by increasing the batch size. And they were also able to show that when you increase the number of GPUs in the system, you also need to increase the amount of main memory you have in order to support those GPUs. So, that was part of the other part of the experiment they did too.

Okay. And then, the last one here is a YouTube YouTuber by the name of Level One Techs is the name of the channel. They did a, actually it was a demo of the benchmarking the new Xeon Intel Xeon 6 processor in a Supermicro server. And they have one of our DDR4 add-in cards. And what, what they were able to show is that it works in the system. The BIOS supported it, the OS supported it, you know, everything off the shelf kind of came up relatively easily and was, was working well. And he ran some MLC tests on this system too and showed with this Intel Granite Rapids processor is getting around 320 nanoseconds of latency without any kind of tuning or adjustments or anything, just off the shelf. And then also a bandwidth right in that 50 gigabytes per second, you know, that's a lot before. So, this shows you, you can buy a system, you can buy CXL and plug it in, and you know, it should, you should be able to be off and running.

So, that's it. The conclusion is that CXL is here. It's here today. It helps us break through that memory wall. It really supports those memory-hungry applications like AI and machine learning and high-performance compute, etc. And performance is looking really strong. It's beating expectations, and we're really excited about what the future holds. Thanks very much for watching.