CDI-Info/190 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
YouTube:https://www.youtube.com/watch?v=4iSwpXIbwTk
Text:
Okay, great. I just heard Michael's presentation,  which is great. This is some ways these are like a companion slide here. This is a last Frank mentioned earlier. This is from an end-user viewpoint,  how CXL memory expansion can help them. First, I'll talk about the motivation for  the memory expansion about how we can scale up a server,  and what's the reason for doing that. We've been seeing this a lot from more like their traditional HPC,  high-performance computing and database use cases,  and emerging as previous speakers mentioned,  the AI use case, which actually shares a lot of  their characteristics with HPC computing. The three drivers for this are when we have ever larger dataset,  and I think Michael's memory wall would be a good example of this. When we have to process ever larger dataset,  sometimes it's not possible to  share these dataset and then process them on a distributed fashion. Then in that case, you really need to have  a scaled up single server to handle that. We have seen this in the genomics case most recently,  but it's true for other simulation as well. Second one is faster time to result. Comparing a scale up solution versus scale out solution. We have a scale out solution,  which is the most typical way that people solve these large problems. Distributed system will also introduce overhead in  terms of cross-node communication, lower CPU utilizations. By consolidating all these work into a larger node,  if it's possible, then we get some benefit. Here's an example. This is from genomics workload, MetaBat. They process large gene sequencing data. Previously, this is our friend at Lawrence Berkeley Lab. They were using a cluster of 100 nodes,  each one with 64 gigabyte. It was taking more than two weeks for our analytic run. Because it's such a large cluster,  there's a high chance of a failure in between. When they were consolidating all this workload into a single four terabyte node,  they were able to reduce the runtime to four days and then  have a fairly consistent finishing rate without failure. That's a time to result. I think the third one here is this TCO,  total cost of ownership, or more specifically,  higher performance to cost ratios here. I'm using an example of SQL Server. This is very much like an OLAP database,  analytics database, and we run some tests. On a single baseline, 64 gigabyte node,  let's assume this is a baseline query per second. If you have to support more queries per second,  one solution is scaling out with additional servers. The cost of this is additional server itself,  but also the software license cost. You have to deal with, usually,  these enterprise softwares are licensed on a per CPU basis. If you run the SQL Server, for instance,  the price is about $7,000 per year per CPU core. When you do this scale out solution,  you increase the CPU core usage and then  the software license cost is just often just as much,  if not more than the server cost itself. Now, with CXL memory expansion,  we have another opportunity to actually scale an individual server. We can support the same two X queries per  second by just expanding the memory on the existing server. That way, we can cut out the additional server costs,  as well as additional software costs based on CPUs. Those are the motivations from an end-user standpoint,  why we want to do this CXL memory expansion to scale up a server.

Now, maybe I jumped the gun a little bit. But scaling up a server,  particularly on the memory side,  there are challenges. We can probably group them into a couple of categories. One is that there's a limited DIMM slot on the motherboard. There's only so many memory modules that you can use. I think pretty much as you have seen in Michael's presentation earlier,  there are limitations on the capacity and the bandwidth  that will be available from these limited DIMM slots on motherboard. Then also because of this limited slots,  if you want to go up in memory,  you essentially have to use these high-density memory modules. These high-density module, particularly at 128 gigabyte capacity,  their costs go up significantly. It's a per gigabyte cost,  almost triples because of  all the manufacturing technology limitations that we have. There's a big component of this challenge is cost-related. What I'd like to go through in this presentation is how  CXL memory expansion can address  these issues along with our memory machine software. That you can essentially have the capacity,  have the bandwidth that you need at  the cost without impacting the application performance. Basically, have the cake and eat it too.

This is maybe a refresher on some of  these CXL memory expansion alternatives. I think Ninesh had gone through the E3.S modules. They're easy to load and you can add or remove them just like SSDs. It's right in the front of the server. However, they're usually in fixed capacity. I think we have seen 128,  256, or even up to 512 gigabyte models announced. They're typically x8 PCIe lanes. That's the bandwidth on these E3.S modules. Next up is the PCIe add-in card  that we can connect them to their PCIe slots. They come with their own DIMM slot. You have the flexibility to populate them with  a different density. They can go up to two terabyte per card on  these eight DIMM systems. In terms of memory bandwidth,  they can go up to 16 PCIe lanes. For reference, it's about like a one DDR5 channel of bandwidth. I think from a previous presentation,  there are other alternatives. From a hyperscaler standpoint,  you have the dedicated connector and board configuration. But this is from an end-user. These are the more common configurations.

The next slide is actually like the money slide. Here, as Frank mentioned,  is that we have been working with some customers on how to  enable their scaling up basically  their compute node within their budget. This table is just comparing the different configurations. I'll go through this table columns first. On these three columns here,  this is the total capacity or the size of memory on a system. This is a total memory cost. This is a per gigabyte cost. These are the memory slots or DIMM modules. We populate on the motherboard. I think either you can call them DRAMs  or like a socket mounted DRAMs to be more exact. Then since we're looking at these rather large memory systems,  so I'm focused on using the add-in card. These columns here are talking about the type of add-in card,  the number add-in card,  and then the type of DIMMs that we populate them with. Let's look at one example. One example is like a four terabyte server. If you are to build it using Intel,  Intel do socket CPUs,  you have 32 DIMM slots available using 128 gigabyte per DIMM. This is we can build a four terabyte system. Then this is essentially the cost and then  the per gigabyte cost associated with these 128 gigabyte DIMMs. With the CXL expansion,  we have another opportunity to reduce the cost. Essentially, we can shift to use a lower cost,  lower density, 64 gigabyte DIMM modules. These are DDR5 as well. They are all DDR5 on the motherboard or the socket. Make up for the shortfall using add-in card memory expansion. This is using four add-in card,  each one with eight DIMM slots and 64 gigabyte DDR5. With this configuration, you end up with the same total capacity. But less than half of the cost for memory. You can even further optimize this by using DDR4. Then we do have some partners that's building DDR4 enabled add-in card. With a DDR4, 128 gigabyte DIMM,  we can reduce the AIC card into two AIC card. Then the total cost of the system can even be reduced further. Other example, let's say if you want to build  a higher memory capacity server, eight terabyte server. Again, the common way to do it is probably use instead of a Dual socket,  you use a four quad socket system. Therefore, you have more of these DIMM slots available. Then this is just on the memory,  cost alone is about $90,000. With CXL expansion,  what we can do is we can shift back to a Dual socket server using more of these AIC card. Again, we can reduce the total cost down to about half of this cost. Likewise, you can play around with all these numbers to build different configurations,  to build 11 terabyte system at pretty reasonable memory cost here. Or if you wanted to go out,  you can build a 32 terabyte system,  but probably at a significantly more cost. This is the different configurations and their memory costs. As you can see, CXL enable you to achieve significant half,  probably 50 percent saving on the cost side.

Here are some more of these servers that can work with CXL expansion card even today. This is from Supermicro,  but we have also seen this from Lenovo as well. In fact, I think Astera Lab had a system with  a Lenovo server at DesignCon that was showcased. You can have various 2U,  4U Dual socket,  quad socket systems like this particular 2U Dual AMD server can take up to  four CXL add-in card. You have a 4U system that can take up to eight expansion card. These are the configurations here.

Now, we're talking about the cake. There's always some trade-offs. With additional capacity and bandwidth from this add-in card,  what you see is that there's a more complex memory hierarchy. Then this is the architecture slide for Dual socket system. You have CPU0, CPU1. Each one of them have their own local DDRs. They're connected through UPI in case of Intel,  and Infinity Fabric in case of AMD. Then in addition, there are additional NUMA node. Each one of these add-in cards are its own NUMA node. They're referred to as a headless NUMA node,  which is a NUMA node that doesn't have its own compute,  but it has memory. This is the architecture.

If we draw down a little bit more detail,  you can see is that for this configuration here,  we'll have a total of six NUMA node. Assume your application or process is running on CPU0. Each one of these NUMA node will have  different characteristic in terms of capacity,  latency, and bandwidth. If your process is accessing the local memory,  it has the lowest latency,  highest bandwidth because we're assuming it's fully populated. If you're accessing the memory across to a different processor,  it's one NUMA halfway,  so there's additional latency here. The bandwidth is actually shared between these.

If your application is running on CPU0,  and then when you're accessing either these NUMA1,  NUMA4, NUMA5, all the memory bottleneck,  the bandwidth bottleneck will be actually between  these UPI link between these two CPUs here. Likewise, if you try to access from CPU0 to NUMA4,  you're two NUMA hops away,  so you will have increased latency.

Then the application performance could actually be impacted by this. One impact of this is the latency. If you have a frequently accessed memory page that's located in,  let's say, in the NUMA5,  whereas your application is running on NUMA0,  you would have this additional latency that you have to  contend with and that impacts application performance. I would say the other impact is on the bandwidth. If you have these frequent across NUMA memory access,  it's possible to saturate a particular link. When you saturate that bandwidth,  the latency goes up significantly. It's much bigger than the 100,  200 nanosecond, it could go up to 1,000 nanosecond or so. This is actually what we had measured. Is that normally if your throughput is within the bandwidth limit,  this is very reasonable,  very close to theoretical latency. However, when you saturate the bandwidth,  then the latency can go up significantly. Then that could even go up to the level that it causes a machine check. It's no longer just a performance issue,  it's actually correctness or your job may fail.

That's actually where we come in from a memory machine perspective. Our product is called Memory Machine X. What we do is essentially mitigate all these impacts,  both in terms of latency and bandwidth impact. What specifically we do is that we continuously  monitor the application's memory access pattern,  and then we optimize their data placement across  a NUMA node so that we would give you the highest performance. Here's an example of MySQL and TPCC benchmark. I have the transaction per second TPS on  the left side and P95 latency on the right side. These are the system configurations. It's the same CPU,  I believe it's an Intel CPU. Originally, we have 64 gigabyte memory,  it's all on the socket. This is at TPS. With the same benchmark,  we measured it with additional CXL memory,  additional 64 gigabyte of CXL memory. You already see a boost here. However, because of all these different latency related issue here,  we haven't really reached the full potential here. When we add in the memory machine software for the sharing,  we actually see another boost in terms of both transactions per second,  as well as a reduced latency here. Then just for reference,  we ran the same benchmark on  a fully populated 128 gigabyte on the socket. You do see a higher transaction per second. But if you were to do that cost-performance ratio,  this is actually a sweet spot here,  is that we can give you with CXL memory expansion and memory machine software,  we can give you the highest performance per cost or vice versa,  the lowest cost-performance ratio.

The next slide is some of,  I think Michael said it really well,  is it takes a village. We've been working with our partners to build out these systems,  so we can get in the hand of the users. This is a selected sample of  the add-in cards partners that we are working with. First up is a Starlab A1000. If you haven't seen the picture,  this is the A1000. I actually saw the real device this morning and it looks great. Smart Modular has other system,  it takes four DIMM slot and it's a single width,  so you can fit into more servers. They also have the eight DIMM system,  and we're also working with other partner that's sampling their system as well. This is a picture of it. This is a Smart Modular system that has four DIMMs mounted with a single width. In general, I think what we're seeing is  more awareness and demand for this,  as well as the ecosystem coming together to provide a solution to meet those demands.

This is a call to action here. If you have any questions or feedback,  love to hear your feedback,  please reach out to me. That earlier spreadsheet, I also have analysis for  AMD-based system as well as various multiple configurations. If you're interested, please reach out to me. Here's also a link of the wonderful Memory Fabric forum  and partner content that you can see. Feel free to connect with those. That's pretty much my slides. Any questions?