CDI-Info/214 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
YouTube:https://www.youtube.com/watch?v=qlKXxUjXCBg
Text:
Welcome to this product demonstration of the Memverge Memory Machine X version 1.4 release. My name is Steve Scargall and I'm the CXL product manager at Memverge. Memverge Memory Machine is a memory management solution for modified applications in heterogeneous  memory environments. So that's systems that have DRAM and CXL or multiple types of memory. Now capitalizing on the revolutionary CXL technology, Memverge Memory Machine X delivers  a really nice user-friendly interface. So administrators can see the system topology with live telemetry and users can unleash  the application performance using our quality of service memory features. A memory machine has some really cool features including visualizing the system topology,  including the CPU, DRAM and CXL. We provide live telemetry as well as historical telemetry and we have memory quality of service  features for latency and bandwidth sensitive applications. We also have memory insights that gives you insights into the application memory usage,  including the hot working set size, capacity, etc. And we'll show all of this and more throughout the demonstration. So with that, let's get going.

Installing Memory Machine is very straightforward. After you've downloaded the binary, you can just go ahead and install it on the sudo. And it'll ask you to read and agree to the end user agreement. So we can go ahead and do that. Press Q. And once you've read it and understood it, you can press Y to confirm. And the installation process will go ahead. So this takes a couple of minutes. So I'll go ahead and fast forward and come back when it's done.

So this is the user interface. We've just navigated to the host name on port 8080 in our browser. Let me just walk you through it real quick. On the top bar here, we have the menu navigation system. So this is the dashboard, the landing page that you'll come to first and foremost. Then we have memory quality of service and insights. And I'll drill down into these momentarily. And on the right hand side, we have the ability to switch between light and dark mode in the  user interface.

And the version of our product is listed top right. Underneath that is a little information about our server. In this case, we see the host name, how long it's been up, the operating system, the CPUs,  how much memory is guard, and the CXL devices listed.

 We summarize the memory capacity for both DRAM and CXL. So here you can see the total system memory. And then broken down again is the amount of DRAM and CXL that we're using in terms of  capacity and percentage used.

Underneath that, we have a system topology view. So this is a two socket Intel server.

All of these objects, you can mouse over them and get more information. You can see the CPUs here in the middle, the DRAM modules if they're populated or unpopulated.

And on the right hand side, again, we've got socket one with a CXL device attached.

Underneath the system topology are our system telemetry metrics. So we can see the CPU utilization, the memory utilization, and the memory throughput. And all of these are interactive.

So you can hover over these, see the date and the time, and information about the telemetry  that we're returning to the user. All of these you can maximize if you want to get a zoom in version of this. You can select a specific date and time range if you just want to focus in on that. You can go back to the main thing. If any of these charts, if you zoom in, for example, they stop producing any information  because you've zoomed in on that specific date and time. So you can just refresh the chart to get back to the live information. All the charts by default show you the last five minutes, but this is user selectable. So you can go back and have a look at the last 15 minutes worth of data. You can also look back over the past day and week, whatever is suitable for your requirements. So that's really it for the dashboard.

 So we'll go ahead and look at some of the other features that we have in this product.

So this is the memory quality of service, or QoS page. Again, this is just accessible via the top menu up here. Now the quality of service offers two memory policies. We have latency and bandwidth. So by default, this is turned off. So we can go ahead and click the turn on QoS. You'll get a pop up to the top right here asking you to confirm that you do indeed want  to enable QoS.

Once we enable this, then you have the choice of either latency or bandwidth. And the intent here is that in latency mode, we monitor the active memory pages for the  application or applications that we want to inspect. And the hot memory pages are kept in the fastest tier. So in this case, they'll be kept in DRAM. And any cooler pages will be moved out to CXL memory devices. There is a trigger process size threshold. So your application does need to use more memory than the threshold here. It's just eight gigabytes, but you can change this. And then we can go ahead and click save. It'll ask you if you really want to enable it. We just hit start. And then the QoS is enabled.

So this takes a little time just to trigger and monitor the processes that are running  on the system. And then once we've activated, then we'll start to see some data in the charts up above. And we'll start to see some processes listed in the process list down here. So I'll just leave this for a couple of minutes, let it trigger, and then we'll come back and  take a look and see what happens when we've got real applications running on this system.

OK, so we let the auto detect feature kick in. It only takes a minute or so to enable that. But now that we've left the UI running for a little bit, we can start to see data. The top chart here, NUMA node bandwidth, is showing the throughput of NUMA nodes. So here, blue is NUMA node 2, which is our CXL device. And you can see now that we have Weaviate listed in our processes. So that's our database running in the back end here. And you can see information around that process. So how much CPU it's using, how much DRAM, CXL in terms of capacity, when it was started,  how long it's been running, who's running it. And telemetry for this specific process shows how much CPU utilization it's using, and the  memory as well. And again, we break this down for CXL and DRAM. So this is kind of interesting just to prove that our latency tiering QoS feature is working  and what the throughput and telemetry of that particular process is. If you have more than one process running, you'll see them listed on the left-hand side. You can use all the search feature here to find them. It is a particularly long list. And then again, the information on the right-hand side is shown here.

So now that I've shown you the latency policy working, let me go ahead and switch this over  to the bandwidth policy. It's very similar. We can go ahead to the top right here, click the system services icon here. We can click on bandwidth. Now bandwidth is a little different because what we're trying to do here is aggregate  the combined bandwidth of both CXL and DRAM installed in the system. So we still have a process capacity or size threshold here. Again, it defaults to 8 gigs, so you're welcome to change this. But now we want to also have some kind of ratio. And the ratio is determined by the capacity and performance of the DRAM modules that you  have in the system. So the number of CXL devices you have and its performance characteristics. So the default is 90/10, but you're welcome to change this. We have some discrete values here of 80/20, 75/25, 70/30, et cetera, all the way down to  50/50. And you can kind of see we have blue that represents the CXL memory and green representing  DRAM. So here I can go ahead and just do a 90/10. I can hit save. And it'll ask me again if I'm sure I want to do this, and I am. So I want to hit start.

And then after a few seconds, it's going to switch over to bandwidth. You can see the policy now has changed. It's now showing bandwidth versus latency in the top right. So again, this takes another couple of minutes to switch over. Either you're switching policies like I just did here, or normally you would want to start  with the correct policy for your system. So anyway, this is how you would enable bandwidth policy. And then if you want to stop QoS altogether, again, you just come to the service system  icon here and just hit the enable toggle. It's going to ask you if you want to stop the QoS manager. And if you do, you can hit stop. And then it'll go off and take a moment or two just to stop the service. So again, this is how you start and stop the service, how you switch services, and how  you select between latency and bandwidth. So I hope you found that useful.

 And then next, we'll go and have a look at the memory insights feature.

So here we are on the memory insights page. The intent of memory insights is to give you insight into running applications, so how  much memory it's using, hot working set size, et cetera. And this could be very useful if you're in a pre-CXL environment where you want to know  how much CXL capacity you're going to need when you go and provision that hardware. So I navigated through the memory insights menu option on the top here. We can see a long list of processes running on this particular system.

Because this is a long list, you can use the search feature to look for interesting processes. So here I'm interested in my Weaviate database. So now that I've selected Weaviate, you can see information here around its memory usage,  CPU usage. This system does have CXL in it, and Weaviate is using CXL. But if I wanted to look at the hot working set size, I can activate that. So I can come up to time duration. I can tell our monitoring system how long I want to monitor my process for. So here, the default is an hour, but you can select 5 minutes, 15 minutes, et cetera. I can then just select the toggle here to enable it, and it's going to ask me if I want  to start the monitoring service, which I do. So I hit start, and after a few seconds, it's going to start the monitoring process. Now it will take a minute or two for data to show up in here as we need to collect that  initial telemetry before we can display the chart. So I'm going to go ahead and just pause here and come back when we actually have data to  talk about.

Okay, so we've left this monitoring agent running for about 10 minutes here, and now  we're starting to see some really good data. We can see at the top here how much memory is used, peak and average, and the hot working  set size as well. The charts are updating to show the information around the top memory usage and the hot working  set size of around 20 gigabytes. And below that, we see the individual process information as well. So this is a good utility and tool to give you insights into the running processes and  identify where CXL can benefit you. So if you want to disable monitoring, you can either let the time that you selected  elapse, in this case we had an hour, or you can just go ahead and check the on or off. And it'll ask you to confirm, which we do, and then we'll go ahead and that stops the  process. Since we manually stopped this monitoring agent, you can see the process here now has  an orange dot next to it. And that means that it's a partial report. Meaning that we didn't  allow the report to go all the way to the end of the hour. There is a filter option here, so you can look for and search for actively monitoring  processes, older reports that successfully met their time elapsed requirements, and then  a partial report. Which this is a good example of. So these are the ones that either somebody stopped before they reached the time elapsed  or the process exited before it reached the time interval.

So thank you for watching this product demonstration on Memory Machine X version 1.4. I hope you found it useful. We encourage you to visit memverge.com/cxl for more information about our CXL products