CDI-Info/363 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

Hi, my name is Shyam Iyer. I'm a chair for the Smart Data Accelerator Interface Working Group within SNIA, also a distinguished senior at Dell. I'm going to be talking about the use cases, the proof points that we have, what we're going to see in 1.1 because we already have a 1.0 version of the spec released, and go from there.

So the agenda is as follows: I'll give a little bit of an intro. I'll give you a preview of what's going to come in 1.1, what we're doing with software enablement, the different proof points, and then summarize it.

Okay, so I know some of you attended today morning's discussion, but let me kind of quickly roll through some of the accelerator usage models that the spec or the standard tries to address. Right? These are all representations, but there may be more use cases that you can do with this. So, for example, if you had a simple use case of software-based memcpy offload, and you were in need of an accelerator that can just help offload those buffer copies for you, this is the standard for you. One, it allows you to offload these copies that we end up doing unnecessarily within the software stack, sometimes in the name of flexibility and sometimes in the name of providing the security between different context isolation layers. The interface provides you both. You can have the same stability like an instruction set, and you can also have the security isolation that software context provide. So you can distribute or move the data between different address spaces.

Another use case, which is something that everyone likes to hear about, is: "Can I move this data from my DRAM tier to, say, a persistent memory tier, potentially on the CXL bus?" Yes, you can use the standard and implementation of the standard for doing precisely that.

What if your application was virtualized, running within a guest, and you had a user space-based application running there? So, if you have dealt with virtualization before, your address is not your address. What you see is your address, right? So, it’s a guest virtual address that you see at the user space level. To be able to make the DMA happen, make memory data movement happen, it needs to be translated to a guest physical address to provide it to a DMA device, and then to an actual host physical address from a translation point of view. So, these levels of translations are there for security, but SDXI helps you navigate this in an architected way using this type of memory interface.

If you were to move the data between these memory tiers, whether it's in CXL, DRAM, you know, MMIO space, or in HBM memory, you can use this efficient data mover to be able to do that.

If you were to transform your data as you're moving it, let's say if you're compressing it, encrypting it, performing some kind of, I don't know, row-column transformation of that memory data, you could do that as part of the standard. Let's say if you wanted to momentarily compute on the data as you move it from an initial source location to a final destination location, the standard will also allow you to do that, because now you can create work items specifically to be able to compute on the job.

So this is... a standard that has been architected to be extensible, forward-compatible, and independent of IO interconnect technology. And you can do address space to address space data movement, and you can also have very architected states of how to do the data movement and where it stops and coalesces or resumes from it, because it's very important, because earlier discussion about migration, page migration, and things of that nature that we were just having on the site, you need to have architected states when you move your data to know when the actual data movement has happened and when it can be picked up by maybe another accelerator or software itself, because if you don't know when an intermediate step is, you don't know how to recover from it. And we are working on 1.1 as well, right?

So to give a quick, simplified structure's view on how this actually works, an implementation of an SDXI starts with what we call a function. A function can be a hardware function; it could be a software-emulated function as well, but if you really want the performance improvements that you need, you would want to implement it in hardware. And this doesn’t have to be a PCIe function; it could be a CXL function. It could also be an ACPI-type device model. But the key thing is it has some kind of an MMIO space to which you can program some of the structures that it will start pointing to. To be able to control and get your status updates and things of that nature. So this MMIO space will be programmed with the location of where the context tables are. These are the context tables that need to be controlled, and they have specific control and state information. And a context typically has at least one descriptor ring to which you will enqueue your work into, and a read and a write pointer. Kind of think about it like a head or a tail pointer. Right? And then you will have a doorbell location to which you would tell the function saying, "I want you to pick up this work that I just enqueued into the descriptor." And the buffers are described in the descriptor itself. Okay? And once you're done, the descriptor itself points to a location where a completion may be notified by the function of when the operation has finished. There’s a concept called an A-key table. This is a table that tells you which address spaces can you move this data into. For example, your local space is really just the zero address space, but if you were moving the data into a remote address space, you need to be able to do that, or you need to be allowed to do that, and that’s what the A-key table provides. Same thing is if a remote function is trying to access the data in your space, there is a table that governs and controls it, meaning it allows you to say whether the remote function can access your memory space at all. Now, this shaded region that I’m calling is where the user space or non-privileged users can actually control to be able to perform any kind of data movement or transformation. The outer shaded region is where a privileged software has a lot more control on. So this is important, for example, if a hypervisor wants to provide a user application some kind of a way to access the accelerator directly, it still needs to be able to control all of the outer region to be able to ensure that it can intercept, emulate, or do something else to help with live migration as such. And there is one error log for all of these contexts, and these number of contexts can scale to a large number. One standard way, architected function setup and control. If you are a PCIe-based implementation, then we do register a class code with PCI-SIG, so kind of like NVMe class codes, you can leverage a similar model here and an SDXI kernel-mode driver that is being worked on is going to be an SDXI class code driver.

A common descriptor format, a standard descriptor format; this is just representational operation groups, whether it's the DMA-based group or if there is an atomic group, the administrative operations are part of a separate group. As you can see, there is room for a lot of expansion for future operations. And then the descriptor itself, as I pointed before, points to a completion pointer where you can get a 32-byte aligned memory which contains all the completion statuses. So, this is a pretty standard way of describing how to enqueue the work.

This is a more complicated example of address-based data movement. If you see here, the producer here is in the middle. This is, for example, let's say if a hypervisor is trying to orchestrate a data movement from a VM in address space A to an address space C, it can now enqueue the work. In the descriptor, you will have the address key indexes. Once you get that, you know which address spaces you are doing that data movement. Inside that A-key entry, you will have an index into the R-key table on the remote address space. In that entry, that would allow for this particular requesting function to be able to access that address space. If this checks out, now you can access the source buffer in address space A. Same thing, you will do it on the address space C because it's the destination buffer. And then the address space, you know, permissions checks are checked. Now you can affect the data movement by doing a DMA read on this side and then a DMA read on the other side. This is to allow for security between address space to address space data movement. As you can imagine, you can just do this with a single address space. Just with two address spaces with bidirectional data movement going back and forth. Or you can have n number of address spaces to which you are doing these kind of data movements. Right? So remember this picture because I'm going to talk about how 1.1 takes this even further.

If you're really interested in getting a lot more details into this, I did give a 40-minute topic on this one last year. There's a link to that on 1.0 internals. But also, there is the whole spec that you can start looking at as well.

So let me take a preview of 1.1. Just so you know, 1.1 is not yet released. We are working very hard on it. But here is a preview, and hopefully, you guys can provide any feedback to us based on this.

So we looked at, started looking at, what are the different ways that we could expand 1.1 to, and we thought about, we could do connection manager, new operations, host-to-host investigations, currency models, quality of service, et cetera. So we've been working on a lot of these, doing the thought exercise on what exactly 1.1 needs to look like. And now we have a sort of a model.

So we have a model where we have 1.1, 1.2, and then a 2.0 being planned. With 1.1, it's going to be at least, to begin with, just mostly the data fixes that we have from 1.0. We also have additional use cases that have been prioritized by member participation. So additional transformations, things that we want to do in the 1.1 version. And it also retains compatibility with 1.0. With 1.2, it's going to be planned as, because we want to get 1.1 out of the door quickly because it has the data fixes, we're going to overflow some of the use cases which we don't have completely fleshed out in 1.2. And then we will also contain the compatibility with 1.0 and 1.1. With 2.0, we are waiting the appetite for a lot more intrusive features because there are lots of folks who have come in and told us that, hey, it would be really nice if we had this feature in there. And we want to add that. It might change a little bit of the working queue model, but if we could provide that like a new mode, then it might be better. But if it is something that the industry coalesces towards and says, no, this is really good, this is more superior, and we want this to happen going forward, that might be something that will go into 2.0. But that's one of the beauties of being an open standard is now you are getting the collective brain power from different companies, and they're all coming up with their new ideas to make this more performant. Because at the end of the day, it's an accelerator. If it doesn't really perform as well as you want it to, then it is not doing its job. And that's the idea of this open standard. And for it to be able to use a common pattern that is applicable for a wide variety of implementations is what we are looking at. So I don't want to put words in the mouth, but an example would be if scalable IOV release 2, when it gets out in the public, has some things that they want to work with us and to make it as part of the device model in SDXI, we would be trying to look at it potentially for a 2.0 release.

So what are we trying to do here? So one thing we are doing is, we're creating a new kind of operations called definable operations group, which means you can bring in your new innovation on how you want your operation to be while adhering to the format, adhering to the spec that has been created. And then new operations that some of the members have said they want to implement in 1.1. We've also made some improvements on memory ordering, on how exactly, you know, it's a memory mover. It has specific barriers around memory and how things need to be globally visible, et cetera. And that's where we have made some improvements there. Then we have an improved point of view on how we think a connection manager looks like, how do we think use cases involving memory fabrics, particularly CXL, looks like, host-to-host use cases, quality-of-service use cases, storage use cases involving NVMe, and computational storage, which you saw earlier in the day today, and how we're kind of looking at AI as well, because, you know, hey, if you're not doing AI in this year, 2024, you're not doing it right, right?

So let's see what's 1.1 doing, right? So in the 1.0 model, the vendor-defined operation group definition was very rigid, meaning if you are a vendor and you wanted to just go ahead and, you know, register a opcode for yourself, you had to come to us and say that we're going to be using this. We're not going to tell you what we are doing in there, but, you know, at least to avoid any kind of collisions with other vendors implementing with the same opcode, that would have been the case. That's no longer the case now, because what we're trying to do is allow flexibility in defining new operations. So by doing that, they can now leverage the same software and API without rewriting a lot of the infrastructure code. This new implementation kind of requires them to just define new guides, and then you can define your own set of operations, but they will still, I mean, align themselves to the structures, memory structures that we defined earlier on. So if you are a vendor who is implementing 1.0, define vendor-defined op encodings, so that part may get deprecated. So we kind of did a casual, uh, survey around this, and we haven't found specific concerns around that yet.

The other thing that is likely going to get deprecated is in 1.0, we allowed for mailbox-style communication. So, for example, if you are a producer in one address space, you can send a mailbox notification to a target in the other address space. Now, if you are within the host, and if you want to do mailbox, that's probably okay. But let's say if this SDXI were to expand to use cases that involve host-to-host, then sending mailboxes in this manner becomes very, very kludgy. And even from a hardware implementation point of view, we received feedback that, you know, implementing mailbox for n number of address spaces and being able to ping-pong into each of them isn't really a right thing to do. So we're kind of walking back ourselves on this. This was going to be an optional feature anyway, but we think now the community is coming back with feedback saying that, you know, mailboxes aren't really required. But if you are affected, please come and tell us or yell right now. Okay, now let's get to the interesting part, right?

So, in 1.0, we did something where you can make a copy, right? Now we are getting use cases saying, “Hey, can you make me one more copy at the same time?” So what we call the double copy. So you have a producer in one address space, and it doesn’t have to be this complicated of an example, but if you have a producer in one address space that’s saying, “Hey, I have a data in my source buffer here, but I want you to make copies into this address space and this address space, okay?” Just like a replication or, you know, you’re trying to mirror it or trying to do it on one local node and a remote node copy at the same time. Right? This is an example for that. So the source would be in address space two, but the destination could be in address space three and address space four. So that’s the kind of the power of doing address space to address space data movement in an architected way is now we can keep expanding the use case. Now, somebody may ask me, “Why not do it, you know, M to N, you know, many to one kind of a model here?” So we did talk about that. This particular operation is more synchronous, which means it will not complete until the copy is made to both locations. But if you have something like a one-to-many kind of model where there may be subscribers and publishers, it’s not necessary that everyone receives a copy, but only some of them receive a copy. So that can be another extension to this that can happen in the future. At the moment, the members that prioritize this started this a lot more than the one-to-many use case. But if you have one-to-many use cases, so please do come and talk to us. It does mean that you just need to define a new operation for it and then work through the proposal on how to make it happen. But this serves as an example for that.

So the next one is protecting data integrity. Many of our members told us that, you know, as you move the data, I'm also very interested in ensuring that, you know, you're doing it safely and the data is, you know, there is no corruption in the data. One of the easiest ways to do that is to perform CRC calculations on the data and then provide that along with the data being copied. And we are doing that part. But also, for a lot of the storage use cases, having protection information, PI information, is very important. For example, if you're doing memory-to-memory, and then you might want to check on the memory that you are doing the copy on and to make sure that the PI on that memory is correct. Or maybe you don't have PI on that memory and you want to insert PI into that data. So when you make the copy, now you're inserting the PI into that data as well. You may want to update what is out there because you're making some changes. Now, somebody asked me previously, you know, whether, you know, if you can't trust the CRC on the source information, why would you update it in the destination site? So truth is, with PI, it's not always CRC, the guards that are being checked. It's also the ref tags that are getting updated or the app tags that are getting updated. So you might be wanting to do that as you're moving the data from, like, the source location to a destination buffer location. Or, for example, if you wanted just to compare between two buffers with their own PIs and try to make sure which of those buffers are actually, you know, the right ones. Because if the PI doesn't match, you can't trust the data along with it. So these are the kind of example memory-based operations that you can do with this, and we are looking to do this as part of 1.1.

New data, more operations, that I've already talked about previously.

So one of the most common ones that we are looking at is POSIX-style memory operations, whether it's a memory memfill or a memcompare, you know, all of those things that you actually do with software. Imagine doing it for a terabyte of memory. I think there was an earlier question on if SDXI works on large amounts of memory. One of the easiest, I mean, simplest use cases that one of the members is interested in is: I have a virtual machine, and I want to fill that virtual machine's memory with zeros. So it's a terabyte of memory, and I cannot spend all my CPU cycles just filling it with zeros. That's where this accelerator starts shining in for you. Those are the kind of POSIX memory operations that you would want to do when you offload this to an accelerator. Compression is another use case. We talked at length about it this morning. And then, you know, there are some who are potentially likely to build encryption, and they may not want to standardize it, or they may want to build their own compression algorithms, but they don't want to standardize it because, you know, it's not a common use case. But even if then they can get a GUID for themselves, they can, you know, craft their own opcodes in there and make that part of the definable operations group, leverage the same software that we've been working with, and allow it for your customers. If it gets more popular and there are collisions, that's when you want to come and standardize anyway, which is what this group is now enabling these type of implementations.

Some more ordering improvements. So, for example, in 1.0, we did allow a flag in the descriptor that said, “Okay, all the writes on this descriptor will follow the writes from the previous descriptor.” So that’s kind of the write after write ordering. And same thing you can do for all the operations that this descriptor does, needs to follow after completions from the previous descriptor. So that’s a kind of synchronization that we allowed in 1.0. With 1.1, we are now also allowing for you can do the reads after the previous writes have been flushed into global memory, right? And that’s one of the ways that you can do your ordering without having to fully wait for your operations to complete.

I'll brush through some of these things, but just to kind of give you an idea, this is a point of view of how we think a connection manager will start helping us with. We are just putting the proposals on how a connection manager will broker a connection between different address spaces, so that now we can start doing address space to address space data movement. Somebody may ask me, how does this work with CXL fabric managers, and so? Potentially, a connection manager would be a component of something like a CXL fabric manager, and you can make this happen so that you can do address space to address space data movement, which is owned by maybe two or more different hosts.

This is an example of a point of view of how a CXL-based architecture implementation can exist. In the picture on the left is an implementation of a device, which is a PCIe-based device, or a device which exists within the CPU's package, but it still has a PCIe interface to it. It can look at CXL expanded memory and be able to do data between those memory tiers. So it's relatively simple. It's not very different from what you would think that an architecture device may do something like this. The picture on the right is a little bit more involved, meaning the SDXI device is itself in a CXL device. It may also have memory to it, and it may be able to move the data between its device memory and the CPU-attached memory, and both of these types of models are possible for CXL-based implementations.

This is a picture that was shown previously on how computational storage, NVMe, and SDXI flow together with the different data modes.

What about AI? I mean, there's tons of applications of AI, whether it's doing, you know, all your tensors are in memory, all your manipulations, format conversions can happen in memory, so they have lots of use cases.

Ecosystem is developing. We have the libsdxi library that the workgroup is working towards.

We are also enabling the Linux kernel driver, and then we have lots of use cases where these applications can do.

As far as proof points go, this was shown at MemCon earlier this year. This is an example of a prototype implementation that one of the member companies displayed.

So in summary, we're doing this data movement standard 1.0 is already released, and we have multiple companies involved in the effort. So, if you'd like to learn more or participate more, or join the group, please come talk to us. Thank you.