I'm back with Reiner Pope, CEO  of MatX, a new AI chip company. Last time we were talking about  what happens inside a data center. Now I want to understand what  happens inside an AI chip. How does a chip actually work? Full disclosure, by the way: I am an angel investor in MatX. So hopefully you have designed a good chip. Hope so. I'll start with the smallest  fundamental unit of chip design, and we'll build up to what an actual production  chip is and what its components are. At the very bottom level of a chip, the  primitives we work with are logic gates, very simple things like AND, OR, and NOT. These are connected together by wires that have to be laid out physically as metal traces on a chip. The main function that AI chips want to compute is the multiplication of matrices. Inside that, the fundamental primitive is a multiply-accumulate of pairs of numbers. We're going to demonstrate what that calculation looks like by hand, and then infer what  a circuit would look like for that. It'll be easiest if I do a multiply-accumulate of  a four-bit number with another four-bit number. The clearest primitive is  actually multiply-accumulate. So there's a multiply of these two terms, and  then we're going to add in an eight-bit number. Can I ask a clarifying question? Why is this the natural primitive for whatever computation happens inside a computer? There are a few reasons. It's a little bit more efficient, but the  reason it's natural for AI chips is that if you look at what's happening during a matrix  multiply… What is a matrix multiply in short? There's a for-loop over i, over  j, and over k, of output [i, k] += input [i, j] x other input [j, k]. A multiply-accumulate happens at every single step of a matrix multiply. The other observation is that the precision will almost always be higher in the  accumulation step than in the multiplication step. This is specific to AI chips. You're multiplying low-precision numbers, and then when you accumulate, errors accumulate  quickly, so you need more precision there. This is why we've chosen to do a four-bit  multiplication and an eight-bit addition. Let me make sure I understood that. There are two ways to understand that. One is that the value will  be larger than the inputs. The other is that if it was a  floating-point number it would be… Maybe that part is less intuitive to me. But maybe it's the same principle? It really is the same principle. The separate principle is that as you're summing up this number, you're summing  up a whole bunch of numbers, so you've got a lot of rounding errors accumulating. Whereas in this case, there's only one multiplication in the chain,  so there aren't a lot of rounding errors accumulating in the multiplication. Why are you summing up a whole bunch of numbers? There’s just two numbers there. This summation is repeated j many times. Any errors accumulate. I see. So how would we perform this calculation by hand? As a human, we would probably  separate it into two, but we can do it all in one using long multiplication. For the multiplication term first, we're going to multiply this four-bit number by every  single bit position in the other four-bit number. We write that out. First, 1001  multiplied by this bit position. That is the number itself. Then shifted across by one, we're multiplying by 0. That gives us an all-0 number. Shifted across one more to  multiply by this one, we get 1001. Finally, for this last bit position,  we get an all-0 number again. This gives us a bunch of terms that  we have to add for the multiplication. While we're doing that summation, we might as  well add in the actual accumulator term as well. So we just copy that directly across. So this is the sum. It's a five-way sum that we want to compute. What logic gates did it take us  to get to this intermediate step? We needed to produce all 16  of these partial products. How do I produce one of these partial products? Let's take this number 1, for example here. We produce it by multiplying this  number by this one over here. We can produce that with an AND gate. This number is 1 if both this bit is 1 and this bit is 1. If either of them is 0, then the multiplication of 0 times anything is 0. To produce all of this, we ended up consuming 16 AND gates. In the general case, if I were doing a p bit multiply times a q bit  multiply, this will be p times q many ANDs. Finally, I sum them. Most of the work  is going to happen in the summing. Let me describe the other  logic gate that we use here. AND is almost the simplest logic  gate that exists on a chip. It's almost the smallest. At the other extreme,  the very largest logic gate you'll typically use is something called a full adder. Coming from software, you might think that a full adder adds 32-bit numbers together. In this case, it just adds three single-bit numbers together, so you can think  of it as adding 0, 1, and 1 together. When I add these together, the result can  be 0, 1, 2, or 3, so I can express that in binary using just two bits. As input, it has three bits. As output, it has two bits. The number 2 in binary is 10. This is also known as a 3→2 compressor  because it takes three bits of input and produces two bits of output. Just to make sure I understood: the two inputs are an X and a Y value  and then some carry that came in… The three inputs are all bits in the same bit  position, like three bits in a column here. The two outputs, I've drawn them vertically  here and horizontally here to match this vertical versus horizontal layout. This expresses that things in the same column are in the same bit position, whereas  things in adjacent columns are different. This is a carry out, whereas this was the sum. So if the inputs in the full adder were, say, 101, then the output would be 10. If it were 111, it'd be 11. If it were 000, it'd be 00. If it were 010, it'd still be 01. Got it. Yeah. It's essentially just counting the number of things and expressing that in binary. This circuit captures what we as humans naturally do when we're summing along a column. I'll show one iteration of using the full adder to sum. The way I sum here is going to be a little unnatural for humans. We would sum along the column and then remember the carry, but instead of remembering  the carry, we'll explicitly write it out. We proceed from the rightmost  column toward the left. On the rightmost column, we sum the 1 and the 1,  and that produces a zero here and a carry of one. We've used this full adder circuit on this pair  of bits and produced a pair of bits as output. Now we can do the same thing with this column. We have a column of four numbers, so we'll take the first three of them, run a full adder on  them, and that gives us a 0 and a 0 as output. The sum of these is 00. That's the full adder applied to all these bits. As I've used up bits, I'll cross them  out to indicate that I've handled them. Let's keep going a little bit more. I take these three numbers, I add them, and that gives me a 1 and a 0. I've dealt with these three numbers. Now I take these three numbers and add  them, and that gives me a 1 and a 0, and I've dealt with these numbers. The way to view this is that I have this whole grid of numbers that need to be added. I'm going to keep applying full adders to all the bits here, constantly removing three numbers from  a column and writing out two numbers as output. Keep going over and over again until I  eventually get just one single number coming out. This approach is called a Dadda multiplier. This is the standard for how you do area-efficient multipliers using full adders. Let's try to quantify the circuit size of this so we have a sense of how big  things are and can compare them later. How many full adders did I use? How many numbers did I start with? I have the 16 partial products, which is a product  of all of these terms with all of these terms, plus the eight terms that I'm adding here. I started off with 24 bits. Eventually I produced eight bits on the output. In every step, I was crossing off three numbers and writing two numbers out as a result. Every single use of a full adder eliminates one of the bits here. So how many full adders? It must be 24 minus 8, so there  were 16 full adders in this circuit. This is true in the general case as well. There will be p times q many full adders in this circuit. Let me make sure I understand the logic of that. The input bits, 24, is p x q, plus p + q. The output bits are just p + q. So p x q plus p + q, minus p + q equals p x q. That's right. I think this explains, or at least hints at, the second reason  we chose to do a multiply-accumulate. The first reason is that it's what  shows up in matrix multiplication. The second is that it gave us this very  slick, simple p x q, very simple algebra. We've described this whole procedure. Every single atomic step that I took here becomes a logic gate, and then  the wires are connected together. When I had these three inputs that I used to  produce these two outputs, if I think of mapping this to a physical device, there would be a wire  connecting all three of these things together into a logic gate that produced this output. This is the main primitive, at different bit widths, that's inside an AI chip. We're going to build up from here to how you would use it to run all  the other operations you might want. This might be the wrong time to ask, but  whenever Nvidia reports that this chip can do X many FP4 or half as many FP8, it  seems to imply those circuits are fungible, that there's not a dedicated FP4 versus FP8. But the way you're mapping it out here, it seems like if it has to be mapped out in the logic, you  would need a dedicated FP4 multiply-accumulate and then a dedicated FP8 accumulate. Can you "funge" them? As drawn, they're not particularly fungible. This is actually one of the main choices you have to make when designing a chip: how  much of FP4 and how much of FP8 do I have? Sometimes I'll make that consideration from  the point of view of the customer requirement. Another angle is to equalize the  power budget between FP4 and FP8. When they report those numbers and it just happens  to be the case that it does 2x as many FP4 as FP8, they're just choosing to give equivalent  die areas to all the floating points, and as a result it ends up being…? Why is the ratio exactly 2x? Part of it is that surely it won't  be exactly equivalent to die area. There's a data movement reason. We'll maybe come back to this when we look at how it goes into and out of memories. There's something really nice from a software level about the fact that I can pack  two four-bit numbers into the same storage as an eight-bit number. When I store that to memory, the sizing of the buses that I wire out within  the chip makes that work out really nicely. Come to think of it, it's not just 2x. The amount of area it takes sounds like it's quadratic with the bit length. That's why smaller precision is even more favorable than you might think. This is a really big reason. In fact, Nvidia made a change. Historically, up until B100 or B200, every time you halved the bit  precision, you doubled the FLOP count. For the reason you said, because of this quadratic  scaling, that ratio is actually slightly wrong. You should get an even bigger speedup  than you might otherwise think. Nvidia’s product specs have started acknowledging  that in B300 and beyond, where the FP4 is three times faster than the FP8. Though it should be 4x. Yeah. What I've shown here is the  simplest case of integer multiply. When you're dealing with floating point, as  you do in FP4 and FP8, there's this other term, the exponent, that complicates the calculation. What can we see already from this? I think the big observation you've made is that  there's this quadratic scaling with bit width, which is very effective and is  the single reason low-precision arithmetic has worked so well for neural nets. The other thing we're going to do now is compare the area spent on the multiplication  itself with all the circuitry around it. We'll walk back in time a little bit and  see how GPUs prior to Tensor Cores worked, which is in fact the same way CPUs worked. Where do we stick this multiply-accumulate unit? Generically, I'll describe a CUDA core or a CPU. You'll have some register file which stores some number of entries, maybe eight entries of, in this  case, 4-bit numbers, but typically 32-bit numbers. Inside the CUDA core, I'll have  some register file of some depth, and then I'll have my multiply-accumulate circuit. What it's going to do is take three arbitrary registers from this register file,  perform the multiply-accumulate, and then write back to the register file. It's going to write to this one, but it was able to read from this  one, this one, and another random one. It will take three inputs like this. This is the core data path of many processors. Most processors look like this. You've got some set of registers, and then you've got some set of logic units, or ALUs. We want to analyze the cost of the data movement from the register file to the ALU and back. Ultimately, there's going to be some circuit that says, "Well, I don't  always have to select this guy. I might select any of the  registers at any point in time." The first question is: how can I build a circuit? The circuit I'm going to look for is a mux. In this case, it's going to have eight inputs,  one from each entry of the register file, and it's going to have one output,  which is actually producing this output. What is the cost of this thing? All we have to build it out of is AND and OR. How do we build it? We do the dumbest thing possible. We form a mask. When we want to read the  third entry, we're going to AND every single entry with either 1 or 0 based  on whether that's what we want to read, and then we're going to OR all of them together. Just to make sure I understand the basics. What the mux is doing is just selecting an input? Just selecting, invisible to software. You say "I want input number three,"  and that means there's a mux here. So what is the cost of this mux? An n-input mux operating on p bits. I have n rows. That's just eight rows, and each row is p bits wide. I have to AND every single bit, so I get n x p many AND gates. For every single input I have to decide whether I'm going to mask it out or not. Then I'm going to OR them all together. There's going to be n – 1 times p many OR gates. I've got all of these different things, almost all of them are 0s, but I need to collapse  them from my eight options down into one option. Every step, I need to OR one  row into an existing row. It's actually funny that you don't  think at the level of hardware. You just think, "Oh, I'll just select element  three," and something as simple as that is in and of itself quite a complicated circuit. This is the first step of all of the hidden data movement costs that show up. We're just going to compare. I have to pay this cost. I've got one mux here, and in fact I have two more copies of that for each of the  three inputs to my multiply-accumulate operation. I have this cost, which is 3 x n x p AND gates  over here, compared to p x q gates in the actual circuit that is doing the thing I care about. If we plug in actual numbers, with n being eight, I get 24 x p gates just in the data  movement, compared to—if q is four—4 x p gates just in the multiply-adder. Where is the three coming from? Three different inputs here. What I'm  hinting at is that all of this work, which scales as the size of the register  file—and this is a very small register file—all of this work just moving the data from  the register file to the logic unit is many, many times more expensive than the logic unit. In the most recent ClusterMAX report, SemiAnalysis ranked almost 100 different GPU clouds. Crusoe  was one of only five that made the gold tier. SemiAnalysis found that gold-tier providers  like Crusoe had a total cost of ownership that was 5% to 15% lower than silver-tier ones,  even when they had identical GPU pricing. This makes sense because total cost of ownership  is downstream of a bunch of different things that don't necessarily show up in the sticker  price, but that Crusoe has optimized. Things like how well you detect faults  and how quickly you replace failed nodes. For example, Crusoe was one of the first clouds  to adopt NVSentinel, NVIDIA's own GPU monitoring and self-healing software for enhanced  GPU uptime, utilization, and reliability. This lets Crusoe make use of everything that  NVIDIA has learned about why chips fail across all their different fleets and deployments, so that  Crusoe can catch faults earlier in the process. And once they identify a failure, Crusoe can  swap in a healthy node in less than 10 minutes. Because they’re not running bare metal, Crusoe  doesn't have to spend time installing an operating system or configuring drivers. They can just spin up a new VM on an already running and pre-qualified host. If you want to learn more about this or the other reasons that Crusoe made  gold tier, go to crusoe.ai/dwarkesh. It may be helpful to just  see what a mux looks like, maybe a two-bit or a four-bit mux. We’ll do a two-way mux. We've got two different numbers,  we’ve got these two inputs. These are the inputs that are being selected  between, and the selector can either be "I want this one" or "I want the other one." This is a one-hot encoding. This is what we start with. Let's focus on this case. This is the actual input we got, and we  want to produce this guy as the result. Very laboriously, we AND  this bit with all of these. That produces ANDing this bit with this row. Likewise, we AND this bit with this row. That produces all 0s. There are four ANDs here. Finally, we OR these two together, and this gives a 1. We OR these two together, this gives a 1. We OR these two together, it gives a 0. We OR these two together and it gives a 1. Those are the four ORs. This actually ends up looking a little bit like addition. We did exactly the same set of ANDs. We've ANDed all of these things together,  but then instead of collapsing it by using full adder circuits, we just get a  very simple collapsing with OR gates. But that doesn't look like n times p. This was with n=2 inputs. In the general case, we will have n  rows, and we'll have p bits per row. That gives us the n times p many AND gates. In this circuit I've described, almost all of the cost, seven-eighths of the cost, is in reading  and writing the register file, and only a tiny fraction of the cost is in the logic unit itself. This is the problem to solve. This essentially was the state of play prior  to the Volta generation of Nvidia GPUs. This kind of thing is what  was inside the CUDA cores. This problem statement is what motivated  the introduction of Tensor Cores, which are more generically called systolic arrays. Think about how we're going to solve this problem. We're spending almost all of our  circuit area on something that we really don't care about and is hidden to the  software programmer, and the thing that we actually care about is not much of the area. Make this one bigger somehow while keeping this at the same size. That's the goal. The evolution was that we had baked  this much into hardware at this stage. This single line is a multiply-accumulate,  and this single thing was baked into hardware. The idea of a systolic array is to go  two levels of loops up and bake this entire loop out here into hardware. The idea is that if we have a much bigger granularity fixed-function  piece of logic, maybe the taxes we pay on the input and output are much smaller. Interesting. It sounds like you're suggesting that if you go up one step in the matrix  multiply loop, you can tilt the balance more towards compute than communication. That's right. There are two effects we're going to take advantage of here. One is that we can do more stuff per every trip through our register file. The other is that in some of this loop, we can take advantage of certain things staying fixed. Visually, we're going to look at this matrix multiplication. This portion of the loop corresponds to a matrix-vector multiplication. We'll take a matrix and multiply it by a vector. How do we do this? Every column gets multiplied by the vector and then summed. We're going to sum along columns. This 0 and 3 gets multiplied by the 3 and  7 and gets summed, and then the 1 and 2 gets multiplied by the 3 and 7 and gets summed. There is a multiply-accumulate associated with every single one of these entries in the matrix. We'll draw out these four multiply-accumulates. Just to make sure I understand why there are  four multiply-accumulates: each entry in the column that corresponds to the output  vector is a dot product, and in this case it will be two multiplications and then  the addition of those two multiplications. You're accumulating... Really there's only one addition per dot product, but we like to start with zero. But it includes the initialization of zero. Yeah. We want to have quadratically more compute. We have x times y as much  compute as we had before. But we want to aim for having only  x times as much communication. The intention is to get this  advantage term going as y. We've laid down the multiplications. We want to bring in a vector of size two, and that is already in line with our columns target. That's fine. However, we need to manage the communication of this matrix,  which exceeds our budget of x. The idea is that in an AI context, this matrix  is going to stay fixed for a long period of time. We've got some register files sitting over here. The amount of stuff coming out of this register file… this is the term that we  want to go as x, in some sense. We don't want to bring this full matrix  in from the register file every cycle, because that would cost too much in  terms of wiring from the register file. Our key trick is that this matrix can  be stored locally to the systolic array. We'll store these numbers 0, 1, 2, and 3  in a gate called a register that physically stores these numbers, and we're going to  reuse these numbers over and over again for a large number of different vectors. The optimization here is that the nature of matrix multiplication is that you can store this  square quadratic thing directly where the logic is happening, which has an extra dimension compared  to the inputs that you keep swapping in and out. That’s right. This is the nature of what a matrix multiplication is. You do a lot of multiplication to get one value out. A dot product is the result of a lot of multiplications. So that optimization means you can stuff a lot of multiplication in  before you get some value out of it. That's right. Just to complete the  picture of concretely how that looks: I swapped the 3 and the 2 here. Just like this 0 and 3 is going to multiply by the 3 and the 7, we're going  to form a dot product along columns here. We're going to feed a 3 and a 7 in here. This feeds into this multiplication and also feeds into this multiplication. Likewise, the 3 feeds into here and also into here. Then we're going to sum along here. Starting at the top of a column, we feed in 0s,  and then coming out the bottom we get results. Visually, there's a dot product performed along  columns in a matrix, and that maps exactly to what is done spatially in the systolic array. This is one dot product summed vertically, and this is a second dot  product also summed vertically. What is the data that needs to go  into and out of the register file? We have x amount of data coming  out on the output, and we also have x amount of data coming from the input. With respect to the input and output vectors at least, we've met our goal of having only x as  much data going in and out of the register file. This leaves open the question: I said the  weight matrix is stored locally in the systolic array, so how did it get there in the first place? At some point, you need to boot your chip and populate this data, so where did that come from? The trick is that we just do it very slowly. We very slowly trickle-feed  it into the systolic array. The simplest strategy is that we run  this daisy chain: feed a number in here, and on the next clock cycle it will move  down to the next entry of the systolic array. We can do that in every column in parallel,  this is also going to come from here, and that gives us another factor of  approximately x units of bandwidth coming in. Would you mind repeating  that sentence one more time? We know that we're going to be bringing  numbers only rarely into the matrix. We just want to come up with any construction  at all such that the amount of wiring that crosses the boundary of the systolic  array is bounded to x and not go as xy. A particularly simple strategy is that  we bring a number into the top row of the systolic array in one clock cycle. Then, for y consecutive clock cycles, we bring in the top row every time and  shift all the other rows down by one. This keeps the wiring that needs to  come from this expensive register file only down to a factor of x rather than xy. I see. There are two questions in terms of communication: communication  time and communication bandwidth. You're saying that since we're only going to be  loading this in once, let's minimize bandwidth, because bandwidth equals die area. We load it in slowly over smaller lanes because we're just going to  keep this value in there for a while. Exactly. It's interesting to me that when we were talking last time about inference across many chips, the  big high-level thing we're trying to optimize for is increasing the amount of compute per memory  bandwidth, that is to say, per communication. Here also, we're trying to increase  the amount of actual multiplies or additions relative to transporting  information from registers to the logic. In both cases, you're trying to maximize  compute relative to communication. This shows up all the way up and down the stack. This is close to the bottom, to the gates. There's a version that's maybe even closer  to the gates in the precision of the number format that you choose to use. We saw that same effect. There's a squared versus linear term going  on both purely in the precision of the ALU, but also in the size of the matrix. This unit is the next bigger unit. We had the multiplication circuit, and on top  of that we have a pretty large systolic array. I drew it as 2x2, but older TPUs were described  as 128x128 of this circuit shown here. This ends up being the most efficient known  circuit for implementing a matrix multiply. We've talked about how it seems  obvious that you should try to maximize compute relative to communication. What are non-obvious trade-offs that keep you up at night, about whether you should do  X or Y and it's not obvious what the answer is? Most of the decisions in chip  design are sizing decisions. Already in what we've drawn so far… AI  chips all have this circuit in them. They have a systolic array, and somewhere near  it a register file providing inputs and outputs. Even within this scope, the sizing questions you  have are: how big should I make my systolic array, and how big should I make the register file? These two questions are coupled. One way to think of it is to set a  budget for what percentage of chip area you want to spend on data movement. Maybe I say that I want this to be 10% and the systolic array to be 90%. Then I can size my register file. Bigger register files are more flexible. They allow me to get more application-level performance out, but they take away from  the area spent on the systolic array. I recently ran an essay contest where I asked  people to write about what I consider to be some of the biggest open questions about AI. The submission window closed last week, so I used Cursor to create a couple of different  interfaces to help me review the entries. One interface anonymizes submissions  and hides the necessary information. It lets me group responses by question,  add notes, and record my scores. The other interface helps me review  entrants who also want to be considered for the researcher role that I'm hiring for. The UI puts the applicant's essay right next to their resume and their personal website  so that I can see everything at once. Cursor's harness is really good at helping  these models see and improve their UIs. I watched it render these interfaces in  the built-in browser, take screenshots, click through sections, and keep iterating. At this point, Cursor is where I do most of my work. Whether I'm reading and visualizing a bunch of research papers, or coding up an interface to  review applications, or making flashcards for my blackboard lectures, Cursor just makes it very  easy for an AI to look at whatever I'm looking at and help me understand it and work with me on it. So whatever you're working on, you should do it in Cursor. Go to cursor.com/dwarkesh. Where does the clock cycle of a chip come in? What determines what that is? And what is a clock cycle of a chip? At baseline, it's worth observing that chips are incredibly parallel. You've got 100 billion transistors in a chip. A key thing you need to do whenever you  have massive parallelism is synchronize between the different parallel units. In software, typically you have these very expensive synchronization methods like a mutex. One thread will finish what it's doing, grab a lock stored somewhere in memory,  and notify the other thread that it's done. On chips, we take a very different approach. Every nanosecond or so, all circuitry in the chip will pause for a moment and synchronize. That is the clock cycle. The entire chip typically goes in lockstep  to the next operation in one fell swoop. What this looks like in circuitry is that  the clock is mediated by registers, which are these storage devices we've drawn elsewhere. The way to think of it is: I have some storage holding a bit, which might be 0 or 1. Then I have some cloud of logic, which maybe is this systolic array or multiplier. I have a bunch of inputs feeding into this cloud of logic, and eventually there's going to  be some output register that it writes to. There is a global clock signal  driving all these registers. At a certain instance in time, when the clock  strikes, whatever value happens to be on that wire at that instant is what gets stored. The challenge is that I would like to have my clock speed run as fast as possible. If I run at two gigahertz, I get twice as many operations done per second  as if I run at one gigahertz. But what that ends up meaning is that I'm very  sensitive to the delay through this cloud of logic, because any computation happening in there  needs to finish before the next clock cycle hits. A major point of optimization on any chip  is to make this delay as short as possible. Interesting. The constraint here seems  to be that if you add too much logic, you might risk missing the clock cycle. But if you don't add enough, you're leaving potential compute on the table. Is there ever a situation where you take a probabilistic chance that a computation  finishes, or is it strictly that it either finishes by the clock cycle or it doesn't? In standard chip design, you margin it such that there is a probability, but  it's many standard deviations out. For all intents and purposes, it is a  reliable part and will always meet the clock. There are some weird exceptions, like clock domain  crossings where you go from one clock to another. Then you actually do have to  reason about this probability. But in the main path, you margin it such that  you'll get there 25% of the clock cycle in advance, making it very unlikely that it misses. Where the clocks synchronize, where the registers are, is this something  you determine as a chip designer? Or is it an artifact where you want a certain  sequence of logic, and the software you use to convert your Verilog into what you send to  TSMC just determines that to make it work, you have to put a register here, here, and  here, making sure no single step makes the whole chip's clock cycle longer than it has to be? Inserting them is actually a huge part of the work of designing a chip. It's done by a combination of manual and automatic methods. To show the very dumb version of what you can do here, you can take  this logic and split it in half. Instead of just one cloud of logic, I can have two  smaller clouds of logic that do the same thing, but split them up by a register. If you split it in the middle, you can hit twice the clock frequency. That's great, you get twice the performance, but at the cost of an extra  register, which means more storage. Stepping back, why do we need  to synchronize the whole chip? If you imagine playing Factorio or  something, there's no global clock cycle. Things are just done when they're done. There's iron on the plate, and you can take it if you want. Taking that analogy, the thing you need to be mindful of is if I  have two different paths through some logic. Say I have to do computation  f here and computation g here, and they're going to meet for computation h. There's going to be manufacturing variance. In some chips f will take a little longer;  in some chips g will take a little longer. If I have a signal propagating through, and  the results from f and g have to meet up at h, what can go wrong is that f gets  there early and it meets the previous value of g, or the next value of g. Ah. And h needs to know when to start, when the next iteration has… Exactly. This explains why different chips made  at the same process node, the same TSMC technology, can have different clock cycles. Two chips made at 3 nm might have different clock cycles based on whether they were able to optimize  to ensure no single critical path is so long that it slows down the whole chip's clock cycle. That's right. This optimization I showed here is called pipeline register insertion. We've inserted a register in the middle of the pipeline. This is a pure trade-off between clock speed and area. That is the easy case. There is a harder case too. I drew out a pipeline of logic, but in other cases you may have some calculation  which actually feeds back in on itself. It runs some function f and  then writes back to itself. For example, this might be an addition where  you're adding a number every clock cycle. This little circuit essentially sums all the  numbers presented on different clock cycles. The challenge is, if this plus  takes too long, what can I do? If I try to put a pipeline register  right in the middle of it, it changes the computation that's done. Instead of forming a running sum of everything that comes in, I will  actually have two different running sums. I'll end up with a running sum of the even  numbers and a running sum of the odd numbers. This constraint—where I have a loop in my logic,  which all chips have somewhere—is the hardest thing to address and sets the clock cycle. I don't understand why it would be a problem to have that. I'm not even sure what it would mean to have a register there. Is it a sort of atomic operation? Well, plus is not really atomic. As you just demonstrated. It took a whole lot of work to do a summation. You can take the early parts of that work, stick a register in the middle, and  then take the late parts of that work. Okay. TSMC offers a PDK which  specifies the primitives of logic they can grant you in the chip. It's up to them to determine that no primitive is bigger than the clock cycle  they're hoping a process node targets. But other than that, can't you just take  all the primitives from TSMC and keep adding registers between them as much as needed  until you get to your desired clock cycle? As a logic designer, the chip  architect sets the clock cycle. For example, the primitives you get from TSMC  are on the order of AND gates or full adders. It depends a lot on voltage and which library you  choose, but generally you can have about 10, 20, or 30 of these sequentially in a clock cycle. These primitives are very fast, maybe 10 picoseconds. As a logic designer, in principle, if you just had a register and an AND gate in a  loop, you could get an insanely fast clock speed, more than four, five, or six gigahertz. But if you take this really simple circuit and look at the area you're spending  here… This is called one gate equivalent in size, so unit of one in area. This thing is maybe a unit of eight in area. Again, almost all your cost  becomes synchronization or communication cost compared to the actual logic. This would be a case where you've gone too far. You've made your clock speed really  fast at the cost of spending almost all of your area on pipeline registers. Interesting. So you're hinting at a dynamic where you can have a really fast clock speed  but you're not getting that much work done. You can have low latency but low throughput. It hurts your throughput, in fact, because the throughput of your chip is the product of  how much you get done per clock cycle—which is based on area efficiency—times  how many clocks you get per second. This is actually so similar to the thing we  were discussing last time about batch size, where if you have a low batch size, any one  user can receive their next token really fast, but the total number of tokens processed in, say,  an hour will be lower than it could otherwise be. Exactly. You get less parallelism out if  you drive your clock speed up really high. Language models are starting to compete  against the best human forecasters. I sat down with two senior Jane Streeters, Ron  Minsky and Dan Pontecorvo, and asked: at some point, does AI just do what Jane Street does? There's a world that we should take seriously where, you know, we're going to build large  language models or some other AI systems that are like strictly smarter than all humans on the  planet and more capable at all cognitive tasks. Trading in particular feels to me as like kind  of AGI complete, sort of like NP complete, because at the end of the day trading involves  figuring out what things are worth, which means making predictions about the future. Jane Street isn't betting against AI. They just signed a six  billion dollar compute deal. But Ron's view is that the edge keeps moving. I have never been more desperate to hire more engineers and more traders than I am today. You know, you have the usual thing of like the other hard parts that we  don't yet know how to automate. Well, that ends up being where  the competitive edge lies. You can find these open positions and watch  the full interview at janestreet.com/dwarkesh. I remember talking to an FPGA engineer at  Jane Street, Clark, who helped me prep for the previous interview we did together. He was explaining why they use FPGAs. I imagine that for high-frequency trading,  throughput is less important than latency, so having very specific control over the clock cycle in a deterministic way  is the most important thing. Maybe it'd be interesting to talk about why  you can't just achieve that with an ASIC, or why you might use an FPGA to have deterministic  clock cycles for high-frequency trading. Let's consider the business  case for an FPGA versus an ASIC. FPGAs and ASICs use largely  the same conceptual model. You have a series of gates built from small  primitives—ANDs, ORs, XORs—connected together with wires running in a fixed clock cycle. Anything you can express in an FPGA you can express in an ASIC too. It will be about an order of magnitude cheaper and have better energy  efficiency on an ASIC than an FPGA. The trade-off is that the first FPGA costs you  $10,000, whereas the first ASIC you make costs $30 million because it requires an entire tape-out. The business use case for an FPGA is when you want something that has very deterministic  latency, fast runtime, and high parallelism, but you are going to change the  workload frequently, maybe every month. You don't want to pay that  tape-out cost every time. How does an FPGA actually emulate the ASIC  programming model in a fixed piece of hardware? At its core, it has the two  components we just talked about. It has registers as storage  devices, and it has lookup tables (LUTs) which provide all of the gates. Then there's a third component. We have a swarm of these registers and LUTs,  and they are connected by a big set of muxes. In front of every single one of these, we have a  mux which selects an input from everywhere else. We have a whole bunch of different  options feeding into all of these things. What this allows is essentially that when  I program my FPGA, I can take all of these components and superimpose a particular wiring  which goes through this LUT, feed it into another LUT, send it to this register, and then feed  it into another LUT, or something like that. What I've drawn in orange is how you…  FPGA means Field-Programmable Gate Array. The orange is what has been programmed in  the field, whereas the white is all the wires that must exist in the FPGA in order to  actually make the device in the first place. What does it mean to be programmed in the field? Programmed in the field means the device is being deployed in a data center. It's sitting out in the world, and then you can come and program it. Ah, not field as in like electric field. Field as in like out there in the world, ok. If I look at how the field programming comes out of the first lookup table and goes  into a second one, how does that work? Where are the wires that make that happen? I got a little bit lazy in drawing all of these. Every single device here has a mux sitting in  front of it, which can select from all of the nearby circuits that are available. The actual configuration of the FPGA amounts to the mux control. In this mux, we have the data inputs, and we have the control that selects. There's a little storage device sitting next to every single one of these muxes saying, "This  is where you're going to source your input from." Programming it consists of configuring  every single one of these muxes. That makes sense. What is happening  inside of the lookup table? The lookup table is also going to have a  little bit of control telling it what to do. Its purpose is to configurably  take the role of an AND gate, OR gate, XOR, or any of those different things. There are many ways you could consider doing that. The way it's done in traditional  FPGAs… A lookup table has four bits of input and one bit of output. How many different functions are there from four bits to one bit? There are 16 different functions. You can tabulate this as 16 different numbers. You've got a table of 0111001, 16 entries. This table is stored in  this blue configuration bit. It views these four bits as binary, looks up the  relevant row of the table, and emits that bit. This is essentially a truth-table  view of lookup tables. Okay, so if you think about an AND gate, OR gate,  NOR gate, XOR gate, these all take as input… Those are two-input functions. Sometimes we have  a three-input function, like a three-way XOR, or a four-way XOR. In this case, does it just depend on how big it is? The typical size for LUTs is four inputs. It's sort of just a sweet spot. There’s another compute vs. communication trade-off here. If it has too few inputs, you need to use more LUTs. Basically the lookup table is a truth table. With a truth table, you can  program in any gate you want. So instead of a lookup table, you can  just think of it as a programmable gate. That's right. One of the things you can do  here is you can see where the rule of thumb that an FPGA is an order of magnitude  more expensive than an ASIC comes from. You count how many gates would  be inside this lookup table. We can view this lookup table  essentially as one of these muxes. It has to select between 16 different values, so it's a mux with n=16 options and p=1 bits. As we saw earlier, this circuit costs n times p many gates. So it costs np, which is 16, AND gates, and also 16 ORs. This circuit being the mux? Exactly, the mux. The mux that goes into the lookup table? The lookup table itself you can think of  as being a big mux that selects from all 16 rows down to one output. That's the lookup table. But the way you've drawn it here,  there's a mux and then a lookup table. It's muxes all the way down. There is a second mux that is inside here. This mux is this mux. And the other mux is just saying… where it came from in this mess of gates. Right, and the second mux is, "Okay, now you have one value, but that  value is still a four-bit value." Yeah, I've selected four bits from the soup. Then I use those four bits to select which entry in the lookup table I'm going to use. Suppose in the first mux you're pulling from eight nearby registers as input. That's a total of 32 bits going in. Out of that, four bits come out. Those four bits go into the second mux, which is inside the lookup table. In this case, these registers are single-bit registers. If there are eight nearby registers and lookup tables, then I  have eight bits total coming in nearby. I select from eight down to four different values. There are actually four different muxes, a little mux associated with each of these input bits. Each of them is selecting one out of eight. Where are those eight coming from? Nearby registers and other LUTs. Each register is one bit. Yes. I guess AMD or whoever makes these FPGAs  still has to be opinionated about which registers are connected to which registers. You can program in the actual gates, but they add a wire in the connect…  the communication topology, right? You get flexibility at a local grain. There's a nearby neighborhood you can select from, but for more coarse, long-distance  connections, they form an opinion on that. And the reason it's 10x slower is why? If you look at the cost of building this lookup table, it's 32 gates. It can give me the equivalent of—what’s an interesting thing I  can do here—a four-way AND gate. A four-way AND means AND, AND,  and then an AND of an AND. This is a circuit I could implement in  an ASIC directly using three AND gates. Using a LUT, I can also implement it, but  it's going to take 32 gates instead of three. So the overhead is really coming from  the fact that there's a more concise way to describe a truth table than listing out  every single possible combination of inputs, which is just to write out the gate. Yes, to place down the polysilicon and the wires and so on. Interesting. One important point you made to me is that the reason they prefer FPGAs to  CPUs is that they get deterministic clock cycles. They know when a packet will come in and go out. Why isn't that a guarantee in CPUs? You can actually design a CPU that  has deterministic latency as well. In fact, the processors inside a lot of  AI chips also have deterministic latency. Groq has advertised this. TPUs  have that in the core as well. The challenge is getting deterministic  latency and high speed at the same time. Non-deterministic latency comes from  specific design choices in a CPU. It's actually possible to remove those design  choices and make a CPU with deterministic latency, but those are not very attractive in the  market, so people don't make those CPUs anymore. In some sense, deterministic latency is a simpler  starting point, and some chip designers have added things in to make it non-deterministic. To take a concrete example, probably the most important source of non-determinism  on a CPU is the CPU cache itself. In a CPU, you have the CPU die itself,  and then DDR memory off on the side. You have a cache system inside that remembers  recent accesses to DDR and stores them. When I'm running through my  CPU instructions, every time I have an instruction that accesses memory, it  first checks if the data was stored in the cache. If not, it fetches it from DDR. This is a huge optimization. The cache is two orders of  magnitude faster than the DDR. If you never used the cache, basically all  programs would run a hundred times slower. The presence of a cache is absolutely necessary  for a CPU to run at a reasonable speed. But whether or not you get a cache hit  depends on the ambient environment of the CPU: what other programs are running, what has  run recently, and what the random number generator inside the cache system is doing. That is a big source of non-determinism in the runtime of a CPU. That is the memory system for a CPU. The big thing you can do differently  is, instead of having the hardware say, "I’m going to read memory" and then the  hardware decides whether or not it comes from the cache, you can bake this decision  into software, a different design philosophy. You see this in TPUs, for example. I'll draw the same diagram, but I'll call it a scratchpad. The main difference is… This would be a TPU, and you have HBM in this case rather than  DDR, but it's still an off-chip memory. Instead of the software saying "first access  memory" and letting the hardware decide, you have one kind of instruction that goes to  the scratchpad and a totally different kind of instruction that goes to HBM. This style is generically known as scratchpad instead of cache. The key distinction is that you have one kind of instruction that says "read or write  scratchpad," and a totally different instruction that says "read or write HBM." So scratchpad being the cache. Yeah, this thing here is the scratchpad. Stepping way back: people say computers have the "von Neumann architecture", where there's  this serial processing of information. Maybe it's just because we've  been talking about parallel accelerators, but the FPGA is super parallel. The AI accelerators, the TPUs, are super parallel. Even CPUs are super parallel if you  think about all the cores they have. In what sense is modern hardware  actually the von Neumann architecture? Is that actually a fair way  to describe modern hardware? I think it's a fair way to describe CPUs. The amount of parallelism you get on a CPU is about 100 cores times maybe 16-way vector  units, so about 1,000-way parallelism on a CPU. One question: there is a die being used for  the CPU, and if there are fewer threads, just as a matter of transistor voltages  switching on and off, is it that there's literally one control flow—a small part of the  die—where voltages are switching on and off? How do you actually occupy the die area of a CPU… If there are so few cores, what are you spending all of the die on? Yeah, what is happening there? The cores are just much  bigger and more complicated. We should compare a CPU core, which takes  up one one-hundredth of the die, to a LUT. A LUT is only 16 gates. It's clear why there are so many more LUTs in an FPGA than cores in a CPU. But why are there more CUDA cores, for example, than CPU cores? What's the difference between a CPU and a GPU? Inside the CPU, one big use  of the area is the cache. Very little is actually the ALUs. Mostly it's these register files rather than the logic units. Both of those have equivalents in a GPU, so that's not a big difference. But the thing that does not have an equivalent in a GPU is the branch predictor. There is a whole big area in the CPU which is just a bunch of predictors saying when the next  branch will be and where the branch target is. Stripping a lot of that out, as well  as making these register files tighter, drives a lot of the GPU gains over the CPU. What is the purpose of the branch predictor? To execute both branches at  once, or what does it do? The issue is that when I've got a series  of instructions, if I have a branch, the actual step of processing an  instruction takes a really long time. It takes maybe five nanoseconds. The time to notice that I've got a branch, evaluate whether the Boolean is true, update  the program counter to the new target, and then read from the instruction memory  could take five nanoseconds to finish. So in reality, this may  finish somewhere down here. I want to run a clock speed that is much  faster than what five nanoseconds allows. Five nanoseconds is a 200 MHz clock speed. I would like to run at one or two gigahertz. So I need to run other instructions  while the branch is being evaluated. I just want to keep running the  instructions that happen after me. But that might have been wrong. If the branch ended up being taken, then I need to know that instead of evaluating these  instructions, I actually need to jump to wherever the target is and run those instructions instead. The purpose of the branch predictor is to predict, five cycles earlier, that a branch is going to  happen, before you even get to that instruction. If I think about how the brain works  versus what you're describing here, at a high level the differences might be that  while you can do structured sparsity in these accelerators and save yourself some area that you  would have otherwise had to dedicate to gates, in the brain there's unstructured sparsity. Any neuron can connect to any other neuron, and not in ways where they have the column aligned. Then there's the fact that memory and compute are co-located. Although I guess you could say in a way the memory and compute are co-located on these dies too. This is exactly the co-location, in some sense, of the memory and compute. So maybe that isn't a big difference. Another big difference is that the clock cycle  on the brain is much slower than on computers. Partly that's to preserve energy, because the  faster the clock cycle, the bigger the voltage needs to be in order for the signal to settle  and to identify what state a transistor is in. That’s right. I don't know if you have any commentary on what the brain  might be doing versus how these chips work. Let's take the clock speed one first. The clock speed is quite high on a chip because that drives higher throughput. When we compare a GPU running some workload, it's running batch size 1,000. Whereas the brain is not running batch size 1,000, there's only one of me. You could imagine saying, "Take a GPU and instead of running at a gigahertz, run  it at a megahertz," and that would start to look a little more like the equivalent  things you're talking about in the brain. But in the way silicon works, that does not give  you a 1,000x advantage in energy efficiency. What it ends up looking like is you just  run this circuit once to stabilization, and then it will sit idle  for a long period of time. It doesn't consume a lot of  energy while it's sitting idle because most of the energy is consumed  in toggling bits from zero to one and back. Let's talk about the energy  consumption of a circuit like this. The way to think of a bit being stored is that  you've deposited some charge in a capacitor sitting somewhere in the chip implicitly. It becomes charged when the bit becomes a one, and then it becomes discharged  when it next goes to a zero. That cycle of charging the capacitor  and then dumping that charge out to ground is where the energy is consumed. This is called the dynamic or switching power, and it's most of the energy consumption of a chip. There is some other energy consumption just coming from the fact that insulators  aren't perfect, but we'll discard that. Most of the energy consumption comes from  toggling from zero to one and back to zero. If you run a chip much slower and you only  clock it once every thousand clock cycles, you will have 1,000 times fewer transitions. It will be about 1,000 times less energy consumption. But it's not a substantial advantage in energy efficiency. Okay, so you described how a TPU works at a high level. What is the difference at a high level between how a GPU and a TPU work? There is a high-level organization principle that is different, and then inside  the cores things are different. Looking at the high level, we'll  take a GPU and a TPU and see what the top-level block structure looks like. If you think of this as the whole chip in each case, the organization  of the GPU is mostly a bunch of almost-identical units, which are the SMs. They've got an L2 memory in the middle, and then a bunch more of these SMs on the bottom. So there is a fairly regular grid of cores. If we look at a TPU in comparison, you end  up with much coarser-grained units of logic. You end up with just a few matrix units,  which are the big systolic arrays. In the middle you've got some vector unit, and  then you've got your matrix units at the bottom. These matrix units with a vector unit in  the middle make up the whole TPU chip. You can think of scaling this thing  down into a really tiny unit with a smaller matrix unit and a smaller vector  unit, and that is sort of what an SM is. From a very high-level point of view, the GPU has  a lot of tiny TPUs tiled across the whole chip. Oh, interesting. You're suggesting the tensor  core within a streaming SM is analogous to an MXU? Yeah, it's all very similar. I see. If you had more lack of structure, having a bunch of tiny TPUs makes a lot of sense. Whereas if you just have huge matrix multiplications, you might want to avoid the  cost of having individual SMs with their own registers and warp schedulers. Why not just make a huge thing and amortize those costs across the whole thing? This shows up in how large you can grow things. We've seen this theme, especially with the  systolic array, where a larger systolic array amortizes the register file costs better. This design allows you to have larger systolic arrays, whereas the GPU design constrains  you to having small units of everything. There is a trade-off, however. Because of this coarse-grained separation of things, you need to move a lot of  data from the vector unit to the matrix units, through just two lines of perimeter here. If you look at the equivalent thing in a GPU, you've got vector units everywhere, and  you can move data through many different lines. The amount of data you can move between a  vector unit and a matrix unit is actually much higher in a GPU than in a TPU. Instead of having to move all the data through just two lines, you're moving  it through 16 lines of wiring in a GPU. Right. But also you might  have to move across less area. Which is an energy saving as well. So if you can operate entirely within an SM, the data movement is much smaller. But the moment you want to operate across SMs, it becomes more complicated and expensive. So you don’t have to comment, but one might expect that a thing MatX might try to do  is get the GPU-like smaller structure of systolic arrays surrounded by SRAM, but at the  same time make it so that the things you need in an SM to support the CUDA architecture—which  take a bunch of space—you might discard. We've talked publicly about something we  call a splittable systolic array, which in some sense you can think of as big systolic  arrays that can be small systolic arrays too. Cool. Okay, I think that's  a good note to close on. Reiner, thank you so much. Thanks, Dwarkesh.