CDI-Info/354 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

In this presentation, I will explain how to use the CXL-based processing neural memory called CXL PNM solution to use more effectively. After that, Changho Choi will introduce our memory box solution that can be linked with CXL PNM.

This is an introduction to processing in memory and processing near memory, called PIM and PNM, are novel technologies that integrate computation logic and memory semiconductors. By performing computation near memory, it is effective for memory-intensive operations. However, the end-to-end AI and machine learning influence involves both computation-intensive and memory-intensive operations. So, in this session, this presentation addressed a method for effectively using CXL PNM in the heterogeneous system integrating CPU, GPU, and our CXL PNM device.

First, I would like to talk about the PIM and PNM solutions. Both PIM and PNM solutions can reduce data movement between the host processor and memory by computation near memory. In our case, HBM PIM, AxDIMM, SmartSSD, and our CXL PNM solution have been developed and put data transparency and processing closer to memory. However, these solutions are only effective for memory-intensive operations.

The ML-based application involves both memory-intensive and computing-intensive operations. So, this presentation introduces an example of ML applications, a deep learning-based recommendation model, called DLRM, which predicts user interactions based on their experience. In the DLRM inference, dense features such as age and faith are processed in compute-intensive pre-connected layers, while the sparse features such as type and gender are processed through the memory-intensive embedding lookup operations.

So, before introducing our work, I would like to introduce our PNM solution. We have developed many PNM solutions that operate on the DIMM interface and the CXL interface. So, this figure shows that our PNM device, such as the AxDIMM and the CMMDC, CXL memory module with the DLRM compute called CMMDC, is placed on the accelerator logic between the CXL IP and memory controllers.

This figure shows the typical architecture of the CXL PNM controller. It is designed to not only operate as a CXL Type-3 memory expansion, but also effectively manage its accelerator logic. The CXL.io path is used to use a sideband interface for the host processor to configure, program, and control the accelerator logic. In the CXL PNM for DLRM acceleration, the host processor can configure, program the PNM engine using the CXL.io path, and read the results through the CXL.mem path.

So now I would like to talk about performing the end-to-end DLRM inference using XPU and our CMMDC device. When used with the host processor that can support the CXL protocol, the host processor and CXL PNM can improve the performance by performing the computations that they are good at in parallel. So, in this case, the CXL-supported processor can conduct compute-intensive operations such as a pre-connected layer, and our CMMDC device can conduct a memory-intensive embedding lookup operation. The embedding lookup operation on our CXL PNM device is performing sequentially through the instruction write, performing computation, reading result, and register pouring to check the completion.

During the computation process in the CXL PNM device, the start and completion check are operated on the CXL.io path and register pouring mechanism. Due to the protocol overhead of the CXL.io path, relatively higher latency is consumed than the CXL.mem path. To improve the latency, we solved the problem by allowing the host processor to view the CXL PNM configuration register as a CXL.mem path.

By changing the register pouring from the CXL.io path to CXL.mem path, the consumed time to check the start and finish of the operation at embedding lookup has been dramatically reduced. The execution time of the embedding lookup was improved from about 80% to about 7% of the total time, and it can lead to performance improvement in the DLRM inference latency and throughput. In summary, the latency was improved by about 80% compared to using the CXL.io path, and the throughput was improved by about 4.2 times higher.

Currently, our software stack of the CXL PNM is designed to be comparable with the PyTorch version 1's API, so we are planning to implement it with UXL by considering integration with our CXL PNM solutions.

So I introduced the CXL PNM part, and Changho Choi will introduce our CXL memory module box.

Thank you, Sangsu. Okay, so how about having a bunch of CMM DC in one box? For that context, we also developed a CXL memory module box. This box could host heterogeneous CMM memories that include CMMD, CMMH, and CMMDC. CMMD is composed of DRAM only. CMMH is composed of DRAM plus NAND flash. And CMMDC, as Sangsu mentioned, CMMDC is CMMD plus computation. This memory appliance could be used to disaggregate a composable memory box. It could provide memory pooling and memory sharing features. It supports CXL 1.1 and CXL 2.0 protocol. We made this one, we made this one for rack-mountable chassis, so you could install this memory box in the rack without any issues. In addition to the memory box itself, we also developed a fabric orchestrator called Samsung Cognos Management Console. I will briefly explain that module in the following slide.

In addition to the box itself, we collaborated with SuperMicro, and we developed a rack-scale composable memory bank solution. With this rack-scale composable memory bank solution, we could get highly scalable and composable CXL memory pooling and sharing systems. Also, we could host a processing memory device here, so you could get a bunch of CXL memory and processing memory devices as well. This is a future-proof memory capacity and bandwidth scalable solution. And we provided this one as a kind of turnkey solution. As you may already know, when you play with CXL, there are a lot of complications, including server BIOS configuration stuff to make it work correctly and well. But with this rack-scale composable memory bank solution, you don't have to put in high effort for that because this is a kind of preset configuration. You could just allocate memory to your servers and just run it. This is also for TCO enhancement solutions. Many people have already discussed that there are some stranding issues in this talk, these sessions. So with this dynamic memory allocation and memory pooling, we could avoid memory stranding issues. And we could allocate on-demand memory based on the capacity and bandwidth requirement. And also, we could offload processing to the CMMDC that Samsung presented.

As I briefly mentioned, we also developed a fabric orchestrator. So, this memory fabric orchestrator is aligned to the OCP CMS/DC memory fabric orchestration sub-team architecture. It supports REST API. It shows some performance metrics. In addition to that, we could... This could show host memory utilization and statistics.

In addition to development itself, we ran cutting-edge in-memory database that is widely used in the industry—that is SAP HANA. Here, we set up two systems. The first system, on the left, we allocate one CMMD device, and we set up a second server with higher capacity and more bandwidth. So, capacity-wise, if you allocate more CMMD devices, you could get capacity scalability. In this case, we could get 7x capacity scalability. And also, we add two more—2x bandwidth. We allocate 2x bandwidth to the second server. With this environment, we could get 32% end-to-end SAP HANA performance when we run TPCDS. On top of SAP HANA, this is a real-time performance measurement.

Lastly, I want to call to action! This CXL stop is mainly related to the OCP CMS project. So, I invite you to participate in composable memory system development and enablement. As we discussed, there are many things we need to address, including programming model development—how we could make it more scalable and secure—in addition to that, OCP has a Data-Centric Computing work stream. The Data-Centric Computing work stream is actively discussing processing memory capability and also computation in storage. So, I invite you to participate in OCP DCC also. Thank you for listening to my talk!