CDI-Info/238 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Great, well, thank you. I am-- I think we're a little-- well, we're right on time, just about on time. So I'll get right to it. By the way, I saw a lot of you taking pictures. Feel free to take pictures but really if you just want the presentation, just send me your  card or give me your card. I'll just email to you. I don't really care. There's nothing classified here. So I'm happy to share. So just a few things. William Koss from Drut Technologies We are a few-year-old startup located just outside of Boston with offices in Hyderabad. So I have kind of a split global dev team. So a little bit about what we're doing. The core team originally came from a company called Callian Technologies, which was an  all-optical switching company. And we have been thinking about how we kind of deploy all-optical switches in the data  center. And so we call that a photonic fabric. There's a number of reasons why we want to kind of go do this. And let me just fix my little screen here. There we go. So when we started building the company, the first kind of projects we had, we had some  capital from a large hyperscaler who had lots of individual single GPU server machines. And this is like pre-COVID. And they said, hey, can you actually go figure out how to optically disaggregate GPUs and  then move those GPUs into machines so we could have multi-GPU machines from single GPU machines? Can you build that technology? And when I first went to look at the company, I was not one of the original founders. I knew the team, and I had been advising them. I went and saw this technology, and I looked at it, and I thought, I have no idea what  you guys are doing. Straight out, I just couldn't figure it out. Like, why would someone want this? And then we started kind of looking at it, and they kind of convinced me to go be CEO. And I joined, and we started working on it. And then I realized that, ah, what we really were doing was completely disaggregating the  server as you know it today, and then attaching it all to a photonic fabric, and then allowing  all of those resources to be completely 100% fluid across the fabric architecture. So that's what we were going to go build. And so as we thought about this problem-- and when I say a photonic fabric, I mean it  could be a very large fabric. It could be just a couple simple switches, or it could be 50, 60, 100 switches. It could be 15,000 endpoints. It could be a very large three-dimensional torrid structure-- all optical, all light,  no packet switching in the middle of it. So this is what we were building. So when we think about it, a number of advantages-- primarily power, cabling, performance, latency,  all those wonderful things you heard are very true. But I think there's one more significant advantage. And when I talk to customers-- I've been here, flew in last night. I've been in Europe for a few days doing customer calls. The real advantage is we can move the wires. Straight out, that's the advantage. We can move the wires, and we can create a slice of the fabric architecture around a  workload, a work group, any type of grouping that you want to do in the fabric architecture. So what I mean, we can move the wires. And I'm a network guy. For 30 years, I've been selling packet switches and routers and traditional network architectures. When you wire them up in the data center, you wire the server to a switch, to a spine,  to a core, it's done. That's how it works. And yes, you can get very large scale, and you can build three tiers, and you can do  all sorts of cool things. But if you wanted to move the server from rack one to the switch in rack 100, you'd  have to send somebody down there. And more likely, they're not going to move that. So in our architecture, we build a very large fabric architecture, three-dimensional torus,  XYZ architecture. And we can manipulate the actual fabric around the endpoints. And so that's what we've been building. So this provides a much different architecture than what most people look at. I'm only aware of one company to significantly deploy something close to this. That would be Google. So they have lots of papers. You can go read it. Their TPU-4 paper is probably the best one to go read. So in the fabric architecture, we thought, well, if we had a fabric in which that was  all light, low latency, and you can move the wires, we could then develop this technology  we call slicing. So you can basically slice the fabric around workloads, grouping clusters, and that's the  real advantage.

So I kind of walked through this a little bit already. But when you think about why you'd want to have a dynamic optical fabric, it's really  around moving the wires, your data rate agnostic, your frame format agnostic. In our case, we can actually run-- the first thing we did was PCIe. So PCIe over the optical fabric. But we can also do CXL. We can do RDMA, which we're going to announce very shortly. And so now all of a sudden, you have a very extensible fabric architecture where you can  run Gen 3 PCIe. You can run Gen 5. You can run RDMA. You can run various things in the same kind of fabric architecture. Because it's very, very useful. So we can put it all under software control. Certainly all optical switches, much significant power compared to an electrical packet switch. And the beauty is it's a little bit different than a traditional network architecture. In our case, we can take the workload requirements and imprint it into the fabric. Rather than the fabric kind of working around the workload, we basically take the workload  and we can create the fabric architecture that you want.

And the reason-- I'm going to try to stay on time here. The reason why you do this and the reason why I think we need kind of a dynamic fabric  networking architecture is the workloads are just huge. So this is-- I took this off the web, but pretty much you can all go find it. The source is there. So you find workloads are growing. LLMs are growing. Huge amount of things. People are building at very, very large scale. So we think that really what you want to do is how you manipulate that infrastructure  much better. You'd have much, much higher level of efficiency of resource utilization if we can actually  create and group the resources needed for the compute for the LLMs essentially together  as a single group. Think of our architecture. I was in California last week. I presented to one group and a longtime storage person said, "You built a SAN for all the  compute elements."  So essentially a private network that's completely unshared that's a direct connect fabric architecture. That's essentially what we've kind of gone and built.

So as a company, we've been working at this for a while. We're, as I said, almost five years old. The first thing we did was figured out how to disaggregate the server bus and essentially  attach it to an optical fabric. So just as the prior speaker said, we did a lot of things in how do you move all the  sideband signals over an optical fabric. That's very hard. You're going to move that. So we have a lot of FPGA work. We've then gone on and decided how do we go essentially do that with RDMA. We have new actual hardware cards coming out which haven't announced them yet which actually  integrate a number of co-packaged optics silicon photonics chips because I think an earlier  speaker said, "It's a very hard problem when you're putting a small form factor card into  a PCIe server architecture. You need a lot of bandwidth to come out of that, especially if you want to do full 16  grains of Gen 5 32 gig SerDes. You got 512 gig coming out of a small form factor card. I got some news for you. Plugables don't work. You're out of physical board space to fit into these things. So you had to move to a much more interesting kind of optical technology which we haven't  announced it yet but we will show it at Super Compute this year. And we had to build basically a work level. We call it a workload aware typology. So a lot of software stack integration in our work. So in terms of thinking about how we're going to use photonics, use photonic switching,  certainly we needed to get better integration of port density. So we see co-packaged optics or near package optics, however you want to kind of frame  it as super important to us. But we also had to go do the integrations at the workload. So higher level software systems can actually understand all the end points around the fabric  architecture and can actually imprint the actual network typology that they want. And when you compose a node and you compose a cluster which we call a slice, if it's under  resourced you can dynamically add GPUs to it. You can dynamically add NVMe, FPGA, SmartNICs, DPUs, whatever you're using in your architecture. So this was, putting this all together is a true system architecture on how to work. So when I go out and talk to people, it's not just people trying to solve kind of specific  problems. I'm really talking about people who have, certainly AI, HPC, I kind of think that's  a merged market today, but also lots of video, rendering, those markets are super interesting  to us.

And so today we think about where this is all going. Certainly PCIe is dominant in the marketplace, which is why we focused on that. It's huge, right? So when we first announced our product, we had a Gen 3 product, the market was just transitioning  to Gen 4. Gen 4 exploded with A100s, and then within a year, Hoppers are out. I saw last night that NVIDIA delivered the first blackwell to OpenAI, so you're just  seeing an explosion of speed and density in the fabric architecture. So how do we put all this stuff over the fabric is really what we've been focused on. So we're going to do RDMA aware Photonics very shortly. It's a mid-year delivery for us, and essentially this allows you to kind of virtualize existing  kind of NIC architecture over the photonic fabric, so we can use off-the-shelf servers,  anything with kind of an RDMA NIC architecture. And there's some reasons to go do this. It's not quite as high a scale as a full hardware disaggregation, but it allows you to kind  of move very quickly and merge different types of traffic technology over photonic fabric. And so these are the things that we're kind of focused on.

I'm just going to try to stay on time here a little bit. So really what we think about—I should probably change the title a little bit here on this  slide. But when we look at building large-scale Ethernet networks, even InfiniBand networks  or PCIe switch networks, we think that you just run into a huge power problem. You run into a huge performance problem. You run into a deterministic problem where you can't actually craft the fabric architecture  around the workloads well enough. And so a lot of my colleagues in the prior world, you know, networking people, they're  trying to solve this with spray and pray and 800 gig ports and more silicon and stack more  switch silicon on top of switch silicon. Absolutely think this is not going to work, right? The moment you're out of switch ports or you need more bandwidth in your ports, you get  to go buy a new switch, buy a new chip in the switch. This is a problem.

So really when we think about our fabric architecture, what we're really building is quite a large  fabric, right? So you don't have to build this day one. You can just build a couple switches. But if you think about this architecture, it's rate and format agnostic. You can run 100 gig, 200 gig, 400 gig, 800 gig perfectly in the same fabric. It'll scale. You don't have to upgrade the fabric architecture. You can put different frame formats over it. And when you disaggregate the server, this has a huge change to the TCO, to the change  of the operating model. You know, I have customers who have told me, "Oh, I bought 10 racks over here and I didn't  put enough memory in the servers, so I bought another 10 racks and I have enough memory. But then I got the GPU wrong, so I bought another 10 racks. So now I have 10 racks with the wrong memory, 10 max, 10 racks with the wrong GPU. But then I finally got it right, so now I'm really running 30 racks."  But you know, it's a mix. It's, you know, only certain racks have the workloads I need. You would kind of really never do this in a kind of a disaggregated photonic architecture. What you would do is the server would be disaggregated and you can compose the hardware resources,  the resources around the CPU as needed. And if you needed to upgrade the GPUs, which are on a much faster upgrade cycle than CPUs.  I mean, let's face it, people are running CPUs. you know, 6 to 10 years in a server. I mean, 10 years ago, people would say, "Oh, I upgrade my servers every 2.9 years."  But nobody does that anymore. They run 6 years plus through the architecture. But they want to upgrade the GPUs much faster. I mean, just go look at the NVIDIA announcements. I mean, you just need to scroll through their press releases. Every 18 months they have, or every 12 months they have a new GPU that's available, right? So people want to jump to the new GPU. Well, if you have to consume the entire thing in a new server every time, this is a problem,  right? I mean, unless you got lots of money, this is a real problem. So in an architecture when you can disaggregate that and you can separate the resources, and  this is what we believe, separate the GPUs into pools, separate storage into pools. Any  other specialized processors if you're using FPGAs or, you know, whatever it may be. If you can separate those things into pools and then it gives you the ability to upgrade  those pools or portions of those pools completely separately from the actual kind of consumption  model today, which is a 1U, 2U, 4U, 8U sheet metal box, right? And so to do all this, though, I'll also say my colleagues in the optical world are amazing  because there's a stack of companies out there that are doing incredible things about putting  higher density, you know, internal lasers, higher density, more wavelengths, different  types of wavelengths, all sorts of interesting, cool optical things. And we view that as kind of like an innovation pool that we're just going to grab and put  into our system architecture. So more wavelengths on a fiber, super interesting to us, higher density, switching, all those  capabilities in the fabric architecture. So all of this allows us to essentially build a very deterministic, meaning like we directly  wire resources together. We don't go through any intermediate switches. There's absolutely no state in our fabric. It's a direct wiring concept. So we understand the endpoints and we wire things together. That's it. They don't have to go figure out how many hops they have to go through as BGP or peering  or...there's nothing like that. We don't have to do any of that stuff. It's a direct connect fabric architecture. And this completely changes how you kind of build your data center. And it gives you a much more interesting long-term TCO that will be much more efficient in your  data center.

And we're not the first company to do this, in fairness. The Google guys did that. And actually, that's actually where my team came from. So this is the pictures out of one of the more recent Google presentations. I mean, you can just go search it. They got tons of papers now. You know, this was all like super secret five years ago. But now they've kind of gone and fully published the whole thing. It's well known. Now, they started at the network layer. We're not really doing the network layer. We're desegregating the kind of the compute layer, the server layer. But they started up there because that was easiest for them. They have other capabilities that they can go build. But what they show is you can build what's called a three-dimensional torus. So think of it as a cube. It's the same architecture that we have. And then you can kind of attach things to the cube. You can take portions of the cube out of service, upgrade them, put them back into service. And some of the things that we've pioneered in the fabric architecture is a concept we  call attach/detach. So we've gone and made modifications to the Linux kernel in the server. So you load our software. And it allows us to essentially transparently connect a PCIe bus to an optical fabric. So when you add resources or you take resources away, like you add GPUs, the server doesn't  reboot. It doesn't have to reset. It's a dynamic nature to it. And that gives you a fluidity of resources across the fabric architecture. So we kind of dynamically renumerate the bus architecture while the system is running. But this is a proven architecture. I mean, Google's got thousands of switches deployed at this time. This isn't like -- it's in all their major data centers. They talk about how it runs. And they have papers on reliability and all this stuff. I mean, it's super well proven at this point.

And so really kind of our focus is how we kind of evolve from dynamic -- basically dynamic  static networks. Well, we're evolving to a dynamic network from static networks. And that's -- people are still trying to figure out how to put -- how to make dynamic networks  or static networks work better. And really we just think of it as a fabric made up of lots of optical switch ports with  optics on the edge and all the compute elements. And then what we can do is we can basically call the fabric. And so that's what that kind of shows on the right-hand side. Meaning you can call the whole fabric as a single entity or you can start to call slices  of it. You can say I want a 2x4x8, I want a 4x2x4, I want a full 16x16x16. So this is a 4096 cube. So 4096 GPUs in a cube. We'll have this at the end of the year. You can actually build four availability zones of this. So you can put 16,000 in a large cluster architecture. I'm not saying that people are going to go off and do that today. But what you can do is you can start with just a slice of this cube and then build onto  the slice. You just continue to build and build. You'll fully use the architecture. It'll have a very long life. You'll put all sorts of different things on it. It'll be very, very super cool for you. All your friends will, you know, think you're just like the coolest person. I'll show it right up.

And really the reason why we want to go do this, this is actually from a meta paper. Actually, it's from an MIT paper which has done some work with meta. But workloads are actually different. Not LLMs are the same. Not all workloads need the same type of resources. And so what we can do is from the software architecture, we can actually harness what's  best for the fabric and then we can, or for the workload, and we can imprint that on the  fabric. And that's a very, very different network architecture. That's how we call slice. So we say, oh, this workload requires blank and we're actually going to create that in  the fabric and there you go.

So I'm going to get this right on time and finish this thing right off, I think, perfectly  here. So just a little bit about kind of, you know, we're supposed to put this in the presentation. I will say that, you know, there's a huge power problem and, you know, the only way  you're going to kind of get over the power problem is really through increased amounts  of optics. And the problem with optics though, it's all light, which is very, very cool, but you can't  really read it. You can't buffer it. You can't store it. There's all sorts of problems in actually switching light. And so as soon as we remove OEOs and just go to straight all optical connections is  where the real power performance benefit goes. So that's a wrap on time.