A closer look at power integrity at AI scale

Ask most engineers about power integrity and they’ll tell you the job is simple: get the impedance of the power bus as low as you can. Steve Sandler will tell you that’s exactly backwards. The goal, he says, is to make the impedance as high as the chips can tolerate, at the lowest cost the system can bear. That distinction sounds academic until you put it next to a modern AI accelerator pulling thousands of amps through a core power rail, at which point it becomes the difference between a board that works and one that melts.

Sandler is the founder of Picotest, which builds the test and measurement instruments, and teaches the methods, that engineers use to characterize power distribution networks. He’s been doing power electronics and power integrity work for, by his own count, 50 years, and he spends much of that time inside the labs of the companies building AI silicon. That vantage point is why we wanted to talk to him. The current and switching speeds inside AI hardware have climbed so fast that the math, the measurement tools, and the design assumptions engineers have relied on for decades are all breaking at once.

The Data Center Engineer sat down with Sandler to talk about what power integrity means when a single chip draws as much current as a small substation, why Bode plots no longer prove what engineers think they prove, and why he tells customers to build early, build often, and expect most of it to fail.

Watch the full interview

The following is our conversation, lightly edited for length and clarity.

Let’s start at the top. What is power integrity?

The whole idea of power integrity is to control the power in a way that the chips and the system are able to tolerate.

Steve Sandler: Power integrity really is the art and science of creating power that chips like. As a really broad-stroke definition, that works well, because not all chips require the same level of power. Some can tolerate more noise, some can tolerate less noise, some are more susceptible to crosstalk and others may be less. The whole idea of power integrity is to control the power in a way that the chips and the system are able to tolerate. And of course there’s this commercial side of it that says at the minimum cost and things like that, but at the high level the goal is really just to create power that the chips like.

I think there are a lot of misconceptions about what it is. There was one engineer I saw on LinkedIn that said, “I don’t understand all this fuss about power integrity. The goal is just to make the impedance as low as you can.” And I said, actually, that’s exactly the opposite of the truth. The idea is to make the impedance as high as we can tolerate. That’s very different. So I think there’s a fundamental misunderstanding about what power integrity really means, what its role is, and how we achieve it.

So how are AI workloads changing the demands on power delivery?

Sandler: I don’t think it’s a secret that the AI power levels are increasing at an exponential rate. And the real question is, how does that change the game for us in the world of power? It changes the game in a couple of ways. One is that most power systems today are no longer linear systems. They’re nonlinear, and they’re also not time-invariant. I gave a paper at DesignCon on those topics just in February. And so if we have these systems that are now nonlinear and time-variant, it means that many of the tools we had on hand just basically stopped working.

If we have these systems that are now nonlinear and time-variant, it means that many of the tools we had on hand just basically stopped working.

Bode plots, for example. My entire history, I’ve learned how you make Bode plots and that that’s how you prove systems are stable. But Bode plots had two fundamental requirements: it needs to be linear, and it needs to be time-invariant. Today it’s neither, and that makes it really complex.

What adds to that is the fact that the workloads have currents that are much higher and data rates that are much faster, meaning that data rates and levels can now hit almost any weakness in the power bus. At one megahertz, maybe it’s pretty rare that you’ll be able to find some perfect load cycle that aligns with some resonance in your power plane. Run at a gigahertz and it’s a thousand times more likely, and run it at 40 gigahertz and it’s 40 times more likely than that. So based on the fact that this current is scaling so much, and so is the time, and the underlying nature of the system is different, everything is kind of being impacted.

By everything, do you mean a traditional power system is struggling too?

Sandler: It is. Not only is the power system struggling, but the way that we validate it, the way that we measure it, the way that we design it, all of those things are changing too. For another example, I’ve been teaching two-port shunt-through impedance measurements for power integrity for probably 15 years now. Because the currents have gotten so high in AI, we have power buses now down around 20 microohms or so, and we’ve reached the level at which vector network analyzers can’t really make that measurement anymore. And if we could make that measurement, it would be a small-signal measurement in a system that we think is large-signal because it’s nonlinear.

So even simple things like the way we measure impedance are changing. I gave a paper on that for IEEE in May of last year, on the new methods that we’re going to need to measure impedance of power buses as AI increases.

How much of the challenge is at the component level versus the system or rack level?

AI is forcing us to rethink laminates and how it is that we make printed circuit boards.

Sandler: It’s actually happening at almost every level. Let’s start with the printed circuit board. AI is forcing us to rethink laminates and how it is that we make printed circuit boards. I just recently did a paper with Qnity that makes ultra-thin dielectrics, and now they’re making ultra-thin dielectrics with heavyweight copper, specifically targeting AI. So at the level of the printed circuit board and stackups, we’re already thinking about how we’re going to improve power integrity for these higher-power, higher-speed buses.

Then of course decoupling capacitors are being impacted, and even bulk capacitors are being impacted. AVX just introduced three-terminal tantalum capacitors to reduce inductance. Companies like Saras Micro Devices are building embedded-layer ceramics so we can get the capacitors inside the circuit boards. Current monitoring and current shunts need to be higher bandwidth, they need to be smaller. Just about two months ago I got a patent on a new resistor that does exactly that. It allows us to reduce the insertion impedance, extend the bandwidth of current-shunt resistors. Power modules are getting smaller, they’re getting shorter so that they can be on the backside or the front side for vertical power or lateral power. So almost every level of power, all the way from the card level down to the board level, is being impacted.

Then we look beyond the board, to the 50-volt bus. Think about the rack. We have a rack, and the rack has a whole bunch of trays, and each tray is holding one of these cards. As the power is going up, the system power at the rack is also going up. Negative-resistance instabilities become more complex, and Nyquist and minor-loop theory start to come into play. So we’re having trouble even at that level where we’re providing the 54-volt bus. And then if you go a level above that, we’re in the process of changing from AC power to 800-volt DC power, and that’s all for the ability to improve the efficiency at the rack. So everything, all the way from the 800 volts coming into the rack down to the printed circuit board level, is being impacted.

It sounds like engineers need to look at this more holistically.

Right now we are all equal. None of us knows anything.

Sandler: Some days I say, man, I wish I was a young engineer right now. They’ve got the greatest tools, they’ve got the greatest things to work on, it’s really awesome. And other days of the week I wake up saying, thank God I’m not a young engineer right now. Incredible pressure. Every day is, let’s figure out how we do the next impossible thing. Engineers today are struggling with so many different problems and so many different technologies that it’s really hard for them to look at the details in every different segment.

So yes, there are people in the AI world that are focused on power integrity, but we’re in an interesting time. We’re in one of those disruptions. I heard Jensen Huang, the CEO of NVIDIA, say that he views it as a reset, and a reset’s a good thing. I agree with him. I’ve said it a little bit differently, and I don’t know who said it more eloquently, but I said right now we are all equal. None of us knows anything. I was the expert. I knew how to do two-port shunt-through impedance better than anybody else could do it. And at the point that two-port shunt-through impedance doesn’t work, I need to come up with a new method. The new method doesn’t exist yet. We’re all equal. So I think we’re all in the same boat trying to figure out what is the path forward.

As a result, what you’re seeing is a lot of people throwing stuff against the wall to see what sticks, doing the best they can with the tools they have, and with the lack of tools in the cases that they don’t. Should they be focused more on power integrity? Probably.

With faster transients and more complex systems, how are engineers approaching measurement and validation?

It’s really difficult for us to even make an accurate measurement of what that transient looks like, because ground bounces are in the way.

Sandler: Even this is getting more challenging. The current steps are getting bigger, they’re time-variant, they’re nonlinear, rates are faster, current’s much higher, and that impacts an awful lot of things. One of the things I did a recent article and webinar on was the fact that it’s really difficult for us to even make an accurate measurement of what that transient looks like, because ground bounces are in the way.

So how do we get around that? That’s one of our current challenges. Fortunately Tektronix came along and made a new probe. They actually made it as a current probe, but it’s not. It’s really, for lack of a better terminology, a power-rail probe that has nearly infinite common-mode rejection, so it doesn’t see ground bounce. But even down to the level of how it is that we use an oscilloscope, and the type of probe that we use, impacts the results that we see today. So it’s changing every aspect of everything that we do.

Will engineers know if they have a measurement-validation problem? Is it going to be obvious?

Sandler: It’s a good point. They largely don’t know. In some cases they do. We get a lot of calls here from customers that are struggling to get their measurements and their simulations to agree. And so if their measurements and simulations don’t agree with each other, then yeah, that’s a great big red flag, and they know there’s a problem.

Otherwise, many of them really don’t. I visit AI companies all the time, I’m always in their labs, whether it’s NVIDIA, Qualcomm, whoever, and so I’m able to share what I’ve learned with them. But what I would say is that in the vast majority of labs that I visit, they’re not aware of these limitations, and they’re just making the measurements and reporting what it says.

What’s the biggest design tradeoff engineers are making to maintain power integrity?

Sandler: It’s really hard to pick one. It’s either time or money. We’re in a hurry, so to a large extent we’re making decisions based on what’s easy at hand. What can I reach for and just grab? It’s already here, let’s use that, rather than thinking it from the ground layer up.

We’re in a hurry, so to a large extent we’re making decisions based on what’s easy at hand.

And the second problem is money. It’s a little bit mind-boggling to me how much these systems cost, and yet at the same time how much we care about saving a few pennies. But businesses are businesses. I visit customers all the time and it’s like, how come you didn’t use this capacitor? It would’ve been much better for you, you would’ve had much better power integrity. “Yeah, but it costs more.” Well, you get what you pay for. So I think what we miss is the system-level optimization, of which cost is one part, of course, and time is a part, of course. But at the end of the day, performance matters and it has to work.

Not only that, if you look at the pace that we’re scaling at, if we’re not able to create the mathematics and the simulations that assure the optimum performance now, then we don’t have those to allow us to confidently scale to the next level, and then we constantly keep fighting the same battle. Whereas if we take the time now to quantify and model the system so that we can get really good validation and correlation between simulation and measurement, it gives us a lot of leverage in the next step along the way, when it does scale. And of course it’s going to scale. It keeps scaling. Every year it scales.

Does that mean engineers need to consider over-designing to stay safe?

Sandler: I think it’s a question of risk. Which side is better to err on? Years ago, I think it was around 2015, I gave a paper on rogue waves, and I said that one of the risks of having these high-current, very-high-speed systems is that with multiple resonances in the board, it increases the likelihood that we could stack these resonances and create data patterns that would be able to excite them all at the same time, in a phase relationship where they stack on top of each other. That’s why we called them a rogue wave.

It was pretty easy to do in simulation. I was even able to create a lab experiment where I could pretty easily create a rogue wave based on fictitious conditions. But there’s a little bit of a lack of knowledge in what the new chips actually do look like and what the actual workflow looks like, especially in a system that’s being asked to do new problems all the time, always resulting in new workflows.

For me, I would probably rather over-perform than under-perform. I think you’d be better off. And what feedback do we have for that? The only feedback we really have is failures. It was quite visible when NVIDIA had their melting connectors and things like that. So that’s pretty visible feedback that says, oops, we missed. Other than that, we don’t really have a lot of data that tells us whether or not we over-performed or under-performed. But we must be hitting it about right, because we’re not hearing about these computers failing all over the place, we’re not hearing about crazy bit error rates, so it must on balance at least be okay.

If you had a crystal ball, how do you see power integrity challenges evolving over the next few years?

Sandler: I have my five-year crystal ball that I pull out every five years to make my predictions, and I’m probably wrong more than I’m right. But I would say that where I’m wrong is that it takes longer than I predicted to come up with solutions, and the problems come up faster than I expected.

What I can tell you is that I visit companies that are in the GPU accelerator business all the time, so I can already see what’s coming in the next generation. In 2024, 2,000-amp core power rails weren’t heard of. We built a 2,000-amp emulator and everybody said, why? Then I started to see 2,000-amp chips in 35-millimeter packages, and then 25-millimeter packages. I just saw a chip coming, 4,000 amps in a 35-millimeter package. I saw one at 7,000 amps in a 50-millimeter package. I saw an 11,000-amp chip. And I even visited Cerebras, and that’s wafer scale, but 32,000 amps. Where’s it going? I think it’s anybody’s guess how high it goes, but I think it just keeps heading towards infinity. And how long does it take to get there? It doesn’t take long. It’s going really fast.

If there’s one thing engineers underestimate about power integrity in high-density systems, what is it?

Sandler: How difficult it is to get it right. And by that I mean there are a lot of smart engineers out there, and I’m in constant contact with many of them. They’re really good at doing simulations. They’re doing EM simulations just like they should, they’re building boards just like they should. And yet there’s this constant request for support. How come my measurements and simulations don’t agree? How come it doesn’t work like I expected it to work?

There’s this terrible misunderstanding about how difficult it is to get right, and even how difficult it is to make the measurements.

I think it’s because we’ve made assumptions at the base level. We buy a capacitor, and I don’t care who it’s from, I’m not going to pick on any capacitor companies, but we rely on their data, much of which is, I don’t want to say wrong, because I don’t think that’s actually true. I think the data was acquired in a condition that’s very much unlike the condition we’re using it in. And so we grab this data, and we put it into our system, and we think that it’s going to work right. I just don’t know why we think it’s going to work right. So I think there’s this underlying misunderstanding about the data that we’re getting and what we’re able to do with it.

I think that’s getting better with time, partly because of how many of these companies I’m working with, and we do get them to get the correlation, and we do it by getting better data that does ultimately flow down to the component manufacturers. So I think everybody moves along the escalation train, maybe just not as fast as we’d like, and maybe not everybody at the same pace. But there’s this terrible misunderstanding about how difficult it is to get right, and even how difficult it is to make the measurements.

I used to say that the two-port shunt measurement is the most complex simple measurement you’ll ever make. At surface level it’s really simple. You take two probes, you put them on a board. What could go wrong with that? Well, it turns out that an awful lot can go wrong with that. And I think people are hard on themselves because it turned out to be harder than they thought. The real problem is they just misunderstood what it was going to take to do it.

So engineers should expect more failures at the speed we’re moving?

Sandler: I think you’re right. I lecture a lot, and in one of the lectures recently I talked about the fact that it’s really unfortunate that we’re taught that failure is a bad thing, and so we all try very hard to avoid it. In reality it’s a really great thing. It teaches us what didn’t work. It allows us to incrementally get closer to what does. Eric Bogatin has a famous saying, “Get it right the second time,” because he thinks it’s unrealistic to think that we will get it right the first time, and I think that’s really true.

What I tell my customers is, build early, build often. I don’t even really start with simulation, other than maybe to prove out a topology or an architecture. I can build boards for $5 today. I could build panels of boards for under 20 bucks. So I’ll take a bunch of experiments and I’ll put them on a printed circuit board. I’ll build them, I’ll see what worked, what didn’t, and that’ll get me to the next step really quickly and cheaply. Ultimately I will bring out the simulator, because the simulator is what allows me to optimize it, do the worst-case tolerancing of it, and so on. But to make it work, that’s a whole different thing. Getting your hands dirty and getting a glimpse of what you’re up against early is a really great thing. And you’ve got to know that most of them are going to fail.

Get Data Center Engineering News In Your Inbox:

Popular Posts:

DCE
Advanced cooling methods for data center power electronics
How CDU location can change UPS count, redundancy design, and retrofit complexity
How CDU location can change UPS count, redundancy design, and retrofit complexity
Airsys
Scaling zero-water cooling: Airsys opens 60-acre South Carolina production hub for North American data centers
pic--intro--4
Delta launches prefabricated AI modular data center promising 60% faster deployment
Schneider-Electric-launches-Uniflair-XCA-oil-free-chillers-up-to-2
Schneider Electric launches Uniflair XCA oil-free chillers up to 2.5 MW for AI data centers

Share Your Data Center Engineering News

Do you have a new product announcement, webinar, whitepaper, or article topic? 

Get Data Center Engineering News In Your Inbox: