CDI-Info/140 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

I'm here to talk to you about the Intel technology innovations and the work that we are driving at OCP.

So some obligatory notice and disclaimers, I'll skip past that.

There are about 35 sessions, last time I counted it, that are Intel sessions at OCP this year.  So now at 23 minutes, I'm not going to cover all of them.  So I'm going to touch on a select few topics here, sustainability, immersion cooling, RAS API, CXL memory, and some of the formative enhancements that we are working on, and give you a flavor of the kind of innovations that we are driving in this space.

Before I start there, first I want to look at the industry, look at the trends in the industry that all of us, you and I, work in.  So overall, the global computing market is expected to be in close to a trillion dollars by 2025.  It's a couple of years from now, the global computer market.  This is everything, not just hardware.  And the service providers alone are going to reach in the tune of $150 billion or so in 2027.  It's a pretty good 10% CAGR growth.  And the amount of data that we have all been generating as humans is going to be, and it's not too far, in a couple of years, it's like 180 zettabytes.  So not all of it gets stored.  That's about 2% of the data that you produce ever gets stored.  If you do that math, it's about 4 zettabytes, which is what, 4 million petabytes.  So that would take you, I think this morning somebody was talking about a petascale SSD.  At that rate, it will take you 100,000 racks to store all of that data.  So just to give you a scale for things.  And then the data center market itself is still, it's growing at, again, at 11% CAGR to reach around 600 billion by 2030.  So all healthy, so good place to begin, is all I'm trying to tell you.

So first topic I want to touch on is sustainability.

So essentially I want to talk about where we are going with respect to sustainability.  And let me just build this slide in quickly so I can talk to it.  Sustainability in general requires a system approach.  In order for us to, it has to be silicon, platform, rack, data center, both power and cooling infrastructure, in order to get to a full solution.  And in the past I've talked about all the touch points and what we need to do in each of these spaces.  For today, I'm going to touch on power management and desegregated platform and immersion cooling in this talk.

And one of the things when it comes to sustainability is there are two parts to sustainability, right?  One is op-ex, right?  This is the power that you consume when your data center runs.  That's the operational power.  And the other is, which is like 90% today of all the power that's going, if you plot this.  And then about 10% is the embodied carbon, which comes as part of your platform.  But if you look at things like desegregated platform that I'm showing on the screen, it's more about the embodied platform.  So why am I focusing on a 10% problem?  Because there is a push in the industry to go towards green energy, green source of energy.  So once you do that, the operational problem will start shrinking.  So we want to move.  We want to go where the puck is going, right?  Not to look at the problem as it is today, where the problem will be tomorrow.  Tomorrow's problem will be that embodied carbon will dominate.  So we want to start solving that problem.  So if you look at it that way, so embodied carbon is the number one issue that you want to go solve.  So you take a platform and you look at the platform and say, what's the major source of the embodied carbon?  One of the major sources of embodied carbon in the platform is the PCB itself.  It's about 37% if you look at some of the analysis that's out there.  So how do you solve the problem?  You go from a platform that is fully monolithic, that I showed in the previous slide, to something that's fully desegregated, like we are trying to do with DCMHS and what we call Blue Glacier, right?  So the power supply, the fans, the networking, and even the manageability, everything gets desegregated storage, everything gets desegregated.  So what you call a platform then becomes your CPU and memory, right?  And then that CPU, so then all these other components can live for multiple generations.  That gives you your sustainability value because you're not throwing away those motherboards and having to take in the cost of the sustainability cost, the carbon cost of those things having to be recycled.  And then they don't recycle all.  They're all FR4 based today, right?  So that's one of the ways that we're addressing it.  And there is actually an Intel participant in this panel on Thursday.  I encourage you to go check it out.

So then next, let's talk about operational carbon since I brought it up, right?  So the chart on the left is showing you what the worldwide consumption from hyperscalers alone is going to be in a couple of years.  It's 250 terawatt hours, right, 250 billion kilowatt hours.  So if these guys were a country, right, if all these data centers were a country, they would be more than Spain, more than Australia.  They would be like 16 in the list of power consumers in the world.  So we are standing here in California, and they would consume more power than the entire state of California consumes in one year, right?  Because California, I mean, California is not that far off.  They're like 240 something.  So it's 250.  So but still, the point is, you know, it is still a pretty high number in the first place.  And same as, I mean, again, the US data center power alone is going to be 35 gigawatts by 2030.  So right now, it's 17 gigawatts.  So we're adding another 18 gigawatts.  So if you look at that number again, that's about another 12 to 15 million servers that are going to get added, right?  So you need to go figure out how to, you know, manage the power of these guys, these servers.

So when you talk power, the metric that folks have used to measure the efficiency of a data center has been PUE, power utilization efficiency, right?  It's just the reciprocal of the IT fraction of your data center energy.  So if you're 50%, the PUE is 2.  If you're 90%, PUE is 1.1.  So what happens is, if you look at the chart on the right, the global average PUE has been flattening.  It started out at a pretty awful number, like 2 point something, right, which means you're not that good at utilizing your power.  And it has come steadily down to the credit of the world and the industry in the 2010s, I should say.  But it's kind of flattening to around the 1.5 range.  This is the worldwide average, right?  And if you just take the hyperscale average, just the few, the handful of hyperscalers out there, they're hovering around 1.1.  I checked the data before I came here, and they're all around 1.1, give or take.  And if you look at the latest cooling solutions, like immersion cooling or cold plate, if you deploy them, then you're going to be even closer to 1, right, less than 1, like 1.07, 1.03, that kind of numbers.  Now, if the best you can ever be is 1, and your numbers are getting so close to 1 that you go, what do you do with this metric at this point?  So how exactly are you measuring this, right?  So where do you go from here?

So one of the things that we have proposed is to essentially look at the problem differently.  So you look at it as, so I put 100 watts into the system.  How many watts actually went to useful workload?  And how many watts just got wasted in the path?  So when you look at PUE, you're just looking at how much data center wasted, right?  But it's not just the data center.  What happened to the rest of it?  So if I put 100 watts in with the PUE of 1.1, you're really going to get 57 watts out, right?  Because there is AC to DC loss, DC to DC loss.  There is fans, there is idle on the silicon, idle power on the silicon.  You subtract all those things, you come to a much smaller number.  And let's say we come to this magical PUE of 1.  You're still going to be in the 63 to 65 watts for every 100 watts you put in.  So this is where what we are proposing, and we actually have, I think Murugasamy has a session on Thursday talking about this more, where what we are proposing is IUE, Infrastructure Usage Effectiveness, which essentially goes and looks at, I supplied you X amount of power.  How much of that power actually went to power the silicon as opposed to all these losses in between, right?  And the losses includes the idle losses, which essentially go to silicon just being powered and not towards running an active load, right?  So it's not just looking at the motherboard, it's also looking at the select effectiveness as well.  So this is another way we're trying to move this towards a path where when PUE is 1, where do you go from here?  You go towards something like this, which moves a little bit closer to the IT and starts turning the mirror towards IT itself and say, how well are we doing with respect to these efficiency metrics so we can be better at sustainability?  And of course, please, I encourage you to check out the Intel session on Thursday.

So if you went to some of the keynotes and some of the talks today talked about, one of the reasons that immersion cooling is, and liquid cooling for that matter, has been attractive is that you get to save operational power by almost 30% by going there, up to 30% by going there.  And immersion cooling is one way to do that.  And in order for that, we've been pushing our partners to, but you always look for headroom, right?  Solving, being able to cool the silicon of today is great, but being able to cool the silicon of tomorrow is what's going to keep us comfortable and be in that space.  So we've always challenged all of our partners to get there.  And if you go to the Intel booth, which is at B12 in the expo hall, you will see that we have a demo there that shows up to 1,000 watts of cooling with single phase.  And I would like to thank Submarine Shell, our partners, for getting us there.  And this is a significant accomplishment, essentially, because this is a key way to get us to the thousands of watts of cooling that we need as silicon and platform go towards that space with both compute and AI in the future.  And one of the problems I want to touch on is also manageability, right?  So we needed a framework for being able to manage all these emerging solutions in a common way, right?  So all this immersion hardware.

So I'm going to pick up the pace because I'm running out of time.  So if you look at immersion cooling infrastructure-- and I'm picking on immersion because other solutions are similar.  If you look at a cold plate, you can model it similarly.  There's a tank.  There's a bunch of servers.  There is a coolant distribution unit.  There is a heat exchanger.  And then it goes to a data center facility cooling from there.  And essentially, this is what-- and the CDU could be one or more.  It could have be co-located in the tank, or it could be outside the tank.  And there's a bunch of sensors that are located inside the tank that are connected to the CDU.  And how do you manage them?  And these are the things that you are trying to look at.

So there was no manageability solution that did that.  So what we did was we basically created a Redfish model on top of what Redfish already had called thermal equipment to essentially take the immersion design and essentially look at any immersion cooling solution and be able to comprehend it using Redfish, which is used widely for server manageability.  Because it's a logical extension for how you manage a server to essentially go manage the elements of the cooling and power that are now kind of moving from the data center space into the IT space in here.  So we modeled the redundancy.  So because you could have multiple cooling distribution units, because you're not going to put all of this and put hundreds of kilowatts of equipment and say, I'm going to have a single CDU that's a single point of failure.  So comprehending that is one thing.  And then also the sensors and where actually the modeling sits are two different places.  And we also comprehended that into the Redfish model.  And I don't have time to go into all of it, but we'll be talking about it in the session on Thursday so we can get into it.  The goal is to get this into a space where this is not so much a DSIM, but something that you can comprehend as part of standard manageability.  So you're not looking at esoteric management solutions just to figure out how to manage this thing.  Because there is this blurring that's happening, right?  Because there is a very clear boundary on what is data center and what's IT.  Now as the cooling things moves into the IT space, we need to have solutions like this that are standardized, which is the reason why we are.  And while I'm on this, I would like to thank the DMTF's Redfish Forum, OCP's hardware manageability folks, and also I would like to-- I don't know if he's in the room-- John Bean from Green Revolution Cooling, who has been a great sounding board for all the various technologies as I send them.  Does this configuration valid?  Is this configuration valid?  And he patiently has always answered my questions.  Thank you very much for that.

I'm going to switch gears and talk to you about RAS.

This is another key area, right?  So depending on your viewpoint, the elegance or ugliness that you see in a RAS hardware implementation today is 20 years in the making, right?  When 20 years ago, when we started doing this, we only had access to registers.  And so all of RAS was implemented using registers.  And it continues to be the case today, right?  So with so much hardware variance today, right?  There are CPUs, GPUs.  There are CPUs from multiple vendors.  There is no consistency.  And also, there is only one way.  This is in-band.  So in-band versus out-band, there was no consistency between those things either.  And also, the way hardware is designed has changed from a monolithic to an IP-based architecture.  So the most common thing that platforms do is to assist the RAS stuff through firmware, right?  There is no easy way to do this through hardware itself.  So in order to do this, we have essentially been working with OCP, in fact, to create a RAS API.  So we go from a hardware and a register-based interface to an API-based interface, where you make a call and you get an answer.  That completely isolates you from what the hardware is doing behind you.  So somebody can collect all that data, put it to you in a standard format, and give it to you.  And that isolation gives you the ability to not be worried always about the hardware and also gives you the ability to give you much richer information than what you would get in a single register file.  And also, it takes you away from this implementation-specific detail.  Like today, if you want to go look at a machine check back, it doesn't matter which vendor.  It doesn't have to be Intel, right?  You have to know what processor you're on.  First thing you have to do is to run CPU ID, figure out what it is, and then you can assign meaning to what you're seeing.  And you can just abstract all these things away, right?  So it's a great effort, and it's not just ours.  It's a community effort.  And again, we have a session tomorrow, in fact, and I encourage you to check that one out as well.

And then a couple of things on the CXL side.  There's a lot of stuff that the CXL Consortium has gone out and solved, and to their credit.  And as CXL 2.0 and 3.0 fabric topologies emerge, and memory truly becomes not just something that's locally attached or attached slightly locally because it's farther away, far memory versus near.  It's, in fact, a fabric attached.  Then you really need something that's more of a fabric model, right?  You need to be able to say, hey, what are the attributes of your memory?  When it was locally attached, you can say, what's your memory's ECC, right?  You knew exactly what the ECC, because the memory controller is the one that implemented it.  But when it's fabric memory, you know none of those things.  You don't know what ECC it has.  You don't know what RAS capabilities it has.  You don't know whether it implements encryption.  You know nothing about it, right?  So you need an intermediary that's mediating between you and the memory to say, I want this memory with this attribute, and somebody picks it up and gives it to you.  And then if you start sharing memory, it gets even more complicated, right?  Is I'm allowed to share this memory?  How many people are allowed to share this memory?  When can I share this memory?  When I release the memory, am I supposed to zero this memory?  All of this stuff is the scope of the data center fabric manager.  Again, this is part of the CMS work in OCP, but we are going to actually do some of this work and then figure out the place to contribute this into the industry, because this is a key problem you need to solve.  Before memory fabric becomes a reality, because otherwise the whole adoption is going to end up lagging behind.  So this is why this is a critical problem for us to get into.

So I'm going to talk to you lastly about a couple of firmware innovations that we are working on also.  This is in conjunction with the OCP security project.  So one of the problems you've always had is, is your firmware secure?  So we say, okay, go to the platform load of trust, measure the firmware, okay, great.  But you get the software BOM from the vendor and the vendor BOM says X, and then you say.  But the problem is that there is no way to take the measurements that you come from the platform, the BOM you can come from the vendor, and then some audit information that you're coming from an independent auditor saying that all of this was written in a way that doesn't violate anything.  I'm signing this to say that there is no security gaps here.  Here is the CVSS score on this thing.  And for the first time, we have put together a demonstration vehicle to show at the Intel booth on how the SBOM is taken with the audit trail and put together.  And then we do an SPDM based mechanism to say, yep, we trust this firmware, and then you admit the system into your network at that point.  And we will work with the OCP security project to essentially make this part of the solution there.  What we're trying to do is to create this audit mechanism, because it's not one company, right?  It needs to be some third party independent auditor that says, I looked at this firmware and this works.  And again, we'd like to thank the folks that partnered with us on this project, Microsoft and AMI, for getting us there in record time to be able to do this.  And please check out the booth, the demo at the Intel booth.

And lastly, one of the things that has always bothered me, because it's been-- ever since the age of the IBM PC, the system has booted off of a flash.  And I've never liked that in a modern world where there are hundreds of thousands or even millions of servers out there, because it's a maintenance nightmare, right?  You've got to go maintain all these things together.  So why is it that everything has a flash, and I have to load up the image in the flash, and then I have to hit a reset, and then you've got to load?  And this is something that has bothered us for the longest time.  And we are happy to say that we were actually able to solve this problem.  So essentially, what we did was we took out the BIOS flash completely, took the BIOS image and put it into the memory of the BMC, and programmed the system so it thinks that it's accessing BIOS, but it's really not accessing it from the flash.  And by doing that, you solved a lot of problems.  Your provisioning problems are a thing of the past, because you can provision dynamically.  It doesn't have to be there ahead of time.  You can show up when you need it, which means you can have an image server sitting in the back end somewhere that's supplying hundreds of nodes at the same time.  You don't have to go, here's your BIOS, here's your BIOS, and keep updating.  And I'm picking BIOS.  Same thing applies to any firmware that you're talking about, right?  All the image limitations are gone.  All the speed of update issues are gone, right?  So essentially, the manageability problem that you have in this large-scale data center world, we're trying to solve that problem by adopting a much different approach that, in our opinion, is long overdue.  Again, we welcome you to come check out that demo as well at the Intel booth.

With that, sustainability is not a one-person or even a one-group problem.  I personally lead as one of the leads for sustainability, but still I would say that it touches many, many disciplines, even at OCP, right?  The server project, the data center folks, of course, sustainability project, and even hardware management.  Many, many folks are involved and engage with all of us because it is multidisciplinary in nature.  It takes all of us to go solve it.  And of course, if you're interested in the Fabric Manager on CXL, please engage with Intel.  We would love to work with you.  And anything that we presented on firmware innovations, if it interests you, do talk to us.  And do stop by our booth.  I apologize, it's B12, not A12.  On the booth, check out our demos on immersion cooling to 1,000 watts.  Actually, I think our processor, it's currently showing around 940-something watts.  Our data center, DCMHS system, and the flash innovations I talked about, all of those demos are running in our booth, and we welcome you to check it out.  I think I'm down to my last minute.  If you have questions, I'm happy to answer them, or I'm happy to stick around outside and answer them as well.  Thank you very much for your attention.

I have a question about your IUE calculation.

Yeah.

It seems like if you had a board in the system, and you did the calculation, and then you had a system that's running at 400 to 650, and the energy going in, and the CPU power is at 400, and you have a system manufacturer, do you have operating power for the higher --

Yeah.  Let me paraphrase his question.  His question is, can you game the IUE?  PUE can be gamed the same way, by the way, right?  You know that, that PUE can get gamed the same way.  For IUE, the part that -- it's actually at load.  It's not just you can measure it any time.  You want to put a -- it's at a specific load.  In fact, one of the debates that we are having in sustainability project is what that load number is.  In my mind, it should be around like 50%.  And the reason I say 50% is because if you look at an overall data center utilization curve, it's between 40 and 60, because most of the data centers are there.  And that's where you want to measure it.  And then you stop that gaming, because today, like otherwise, PUE, essentially you can say, I'm going to crank up the load to 100%.  And then your PUE gets better just because you're measuring at a higher wattage.  I absolutely agree with you.  We are going to make sure that doesn't happen with IUE.

On your modular server design, can you comment on the cost? Or is that going to be an impediment, say, for your customers?

The question was on the cost of the modular servers?  No, I don't know.  So I'm the technologist, so I don't know.  I play in the cost space.  But I think it should be, because you're amortizing it over a longer period, I would think overall the cost will go down, not go up.  Right?  I mean, maybe up front, you're paying more because you have more connectors involved.  But if you do a total cost of ownership analysis, I think you will come out ahead.  Yeah, but I'm happy to go through that numbers with you.