I'm back with Reiner Pope, CEO 
of MatX, a new AI chip company.
Last time we were talking about 
what happens inside a data center.
Now I want to understand what 
happens inside an AI chip.
How does a chip actually work?
Full disclosure, by the way:
I am an angel investor in MatX.
So hopefully you have designed a good chip.
Hope so. I'll start with the smallest 
fundamental unit of chip design, and we'll
build up to what an actual production 
chip is and what its components are.
At the very bottom level of a chip, the 
primitives we work with are logic gates,
very simple things like AND, OR, and NOT.
These are connected together by wires that have to
be laid out physically as metal traces on a chip.
The main function that AI chips want to compute
is the multiplication of matrices.
Inside that, the fundamental primitive
is a multiply-accumulate of pairs of numbers.
We're going to demonstrate what that calculation
looks like by hand, and then infer what 
a circuit would look like for that.
It'll be easiest if I do a multiply-accumulate of 
a four-bit number with another four-bit number.
The clearest primitive is 
actually multiply-accumulate.
So there's a multiply of these two terms, and 
then we're going to add in an eight-bit number.
Can I ask a clarifying question?
Why is this the natural primitive for
whatever computation happens inside a computer?
There are a few reasons.
It's a little bit more efficient, but the 
reason it's natural for AI chips is that if
you look at what's happening during a matrix 
multiply… What is a matrix multiply in short?
There's a for-loop over i, over 
j, and over k, of output [i,
k] += input [i, j] x other input [j, k].
A multiply-accumulate happens at every
single step of a matrix multiply.
The other observation is that the
precision will almost always be higher in the 
accumulation step than in the multiplication step.
This is specific to AI chips.
You're multiplying low-precision numbers,
and then when you accumulate, errors accumulate 
quickly, so you need more precision there.
This is why we've chosen to do a four-bit 
multiplication and an eight-bit addition.
Let me make sure I understood that.
There are two ways to understand that.
One is that the value will 
be larger than the inputs.
The other is that if it was a 
floating-point number it would be…
Maybe that part is less intuitive to me.
But maybe it's the same principle?
It really is the same principle.
The separate principle is that as
you're summing up this number, you're summing 
up a whole bunch of numbers, so you've got a
lot of rounding errors accumulating.
Whereas in this case, there's only
one multiplication in the chain, 
so there aren't a lot of rounding
errors accumulating in the multiplication.
Why are you summing up a whole bunch of numbers?
There’s just two numbers there.
This summation is repeated j many times.
Any errors accumulate. I see.
So how would we perform this calculation by hand?
As a human, we would probably 
separate it into two, but we can do
it all in one using long multiplication.
For the multiplication term first, we're
going to multiply this four-bit number by every 
single bit position in the other four-bit number.
We write that out. First, 1001 
multiplied by this bit position.
That is the number itself.
Then shifted across by one,
we're multiplying by 0.
That gives us an all-0 number.
Shifted across one more to 
multiply by this one, we get 1001.
Finally, for this last bit position, 
we get an all-0 number again.
This gives us a bunch of terms that 
we have to add for the multiplication.
While we're doing that summation, we might as 
well add in the actual accumulator term as well.
So we just copy that directly across.
So this is the sum.
It's a five-way sum that we want to compute.
What logic gates did it take us 
to get to this intermediate step?
We needed to produce all 16 
of these partial products.
How do I produce one of these partial products?
Let's take this number 1, for example here.
We produce it by multiplying this 
number by this one over here.
We can produce that with an AND gate.
This number is 1 if both this bit
is 1 and this bit is 1.
If either of them is 0,
then the multiplication of 0 times anything is 0.
To produce all of this, we ended up
consuming 16 AND gates.
In the general case,
if I were doing a p bit multiply times a q bit 
multiply, this will be p times q many ANDs.
Finally, I sum them. Most of the work 
is going to happen in the summing.
Let me describe the other 
logic gate that we use here.
AND is almost the simplest logic 
gate that exists on a chip.
It's almost the smallest. At the other extreme, 
the very largest logic gate you'll typically use
is something called a full adder.
Coming from software, you might think
that a full adder adds 32-bit numbers together.
In this case, it just adds three single-bit
numbers together, so you can think 
of it as adding 0, 1, and 1 together.
When I add these together, the result can 
be 0, 1, 2, or 3, so I can express that
in binary using just two bits.
As input, it has three bits.
As output, it has two bits.
The number 2 in binary is 10.
This is also known as a 3→2 compressor 
because it takes three bits of input
and produces two bits of output.
Just to make sure I understood: the
two inputs are an X and a Y value 
and then some carry that came in…
The three inputs are all bits in the same bit 
position, like three bits in a column here.
The two outputs, I've drawn them vertically 
here and horizontally here to match this
vertical versus horizontal layout.
This expresses that things in the same
column are in the same bit position, whereas 
things in adjacent columns are different.
This is a carry out, whereas this was the sum.
So if the inputs in the full adder were, say,
101, then the output would be 10.
If it were 111, it'd be 11.
If it were 000, it'd be 00.
If it were 010, it'd still be 01.
Got it.
Yeah. It's essentially just counting the
number of things and expressing that in binary.
This circuit captures what we as humans naturally
do when we're summing along a column.
I'll show one iteration of using
the full adder to sum.
The way I sum here is going
to be a little unnatural for humans.
We would sum along the column and then
remember the carry, but instead of remembering 
the carry, we'll explicitly write it out.
We proceed from the rightmost 
column toward the left.
On the rightmost column, we sum the 1 and the 1, 
and that produces a zero here and a carry of one.
We've used this full adder circuit on this pair 
of bits and produced a pair of bits as output.
Now we can do the same thing with this column.
We have a column of four numbers, so we'll take
the first three of them, run a full adder on 
them, and that gives us a 0 and a 0 as output.
The sum of these is 00.
That's the full adder applied to all these bits.
As I've used up bits, I'll cross them 
out to indicate that I've handled them.
Let's keep going a little bit more.
I take these three numbers, I add
them, and that gives me a 1 and a 0.
I've dealt with these three numbers.
Now I take these three numbers and add 
them, and that gives me a 1 and a 0,
and I've dealt with these numbers.
The way to view this is that I have
this whole grid of numbers that need to be added.
I'm going to keep applying full adders to all the
bits here, constantly removing three numbers from 
a column and writing out two numbers as output.
Keep going over and over again until I 
eventually get just one single number coming out.
This approach is called a Dadda multiplier.
This is the standard for how you do
area-efficient multipliers using full adders.
Let's try to quantify the circuit size of
this so we have a sense of how big 
things are and can compare them later.
How many full adders did I use?
How many numbers did I start with?
I have the 16 partial products, which is a product 
of all of these terms with all of these terms,
plus the eight terms that I'm adding here.
I started off with 24 bits.
Eventually I produced eight bits on the output.
In every step, I was crossing off three numbers
and writing two numbers out as a result.
Every single use of a full adder eliminates
one of the bits here.
So how many full adders?
It must be 24 minus 8, so there 
were 16 full adders in this circuit.
This is true in the general case as well.
There will be p times q many
full adders in this circuit.
Let me make sure I understand the logic of that.
The input bits, 24, is p x q, plus p + q.
The output bits are just p + q.
So p x q plus p + q, minus p + q equals p x q.
That's right. I think this explains,
or at least hints at, the second reason 
we chose to do a multiply-accumulate.
The first reason is that it's what 
shows up in matrix multiplication.
The second is that it gave us this very 
slick, simple p x q, very simple algebra.
We've described this whole procedure.
Every single atomic step that I took
here becomes a logic gate, and then 
the wires are connected together.
When I had these three inputs that I used to 
produce these two outputs, if I think of mapping
this to a physical device, there would be a wire 
connecting all three of these things together
into a logic gate that produced this output.
This is the main primitive, at different bit
widths, that's inside an AI chip.
We're going to build up from here
to how you would use it to run all 
the other operations you might want.
This might be the wrong time to ask, but 
whenever Nvidia reports that this chip
can do X many FP4 or half as many FP8, it 
seems to imply those circuits are fungible,
that there's not a dedicated FP4 versus FP8.
But the way you're mapping it out here, it seems
like if it has to be mapped out in the logic, you 
would need a dedicated FP4 multiply-accumulate and
then a dedicated FP8 accumulate.
Can you "funge" them?
As drawn, they're not particularly fungible.
This is actually one of the main choices you
have to make when designing a chip: how 
much of FP4 and how much of FP8 do I have?
Sometimes I'll make that consideration from 
the point of view of the customer requirement.
Another angle is to equalize the 
power budget between FP4 and FP8.
When they report those numbers and it just happens 
to be the case that it does 2x as many FP4 as FP8,
they're just choosing to give equivalent 
die areas to all the floating points,
and as a result it ends up being…?
Why is the ratio exactly 2x?
Part of it is that surely it won't 
be exactly equivalent to die area.
There's a data movement reason.
We'll maybe come back to this when we
look at how it goes into and out of memories.
There's something really nice from a software
level about the fact that I can pack 
two four-bit numbers into the same
storage as an eight-bit number.
When I store that to memory,
the sizing of the buses that I wire out within 
the chip makes that work out really nicely.
Come to think of it, it's not just 2x.
The amount of area it takes sounds like
it's quadratic with the bit length.
That's why smaller precision is even
more favorable than you might think.
This is a really big reason.
In fact, Nvidia made a change.
Historically, up until B100 or
B200, every time you halved the bit 
precision, you doubled the FLOP count.
For the reason you said, because of this quadratic 
scaling, that ratio is actually slightly wrong.
You should get an even bigger speedup 
than you might otherwise think.
Nvidia’s product specs have started acknowledging 
that in B300 and beyond, where the FP4 is three
times faster than the FP8.
Though it should be 4x.
Yeah. What I've shown here is the 
simplest case of integer multiply.
When you're dealing with floating point, as 
you do in FP4 and FP8, there's this other term,
the exponent, that complicates the calculation.
What can we see already from this?
I think the big observation you've made is that 
there's this quadratic scaling with bit width,
which is very effective and is 
the single reason low-precision
arithmetic has worked so well for neural nets.
The other thing we're going to do now is compare
the area spent on the multiplication 
itself with all the circuitry around it.
We'll walk back in time a little bit and 
see how GPUs prior to Tensor Cores worked,
which is in fact the same way CPUs worked.
Where do we stick this multiply-accumulate unit?
Generically, I'll describe a CUDA core or a CPU.
You'll have some register file which stores some
number of entries, maybe eight entries of, in this 
case, 4-bit numbers, but typically 32-bit numbers.
Inside the CUDA core, I'll have 
some register file of some depth,
and then I'll have my multiply-accumulate circuit.
What it's going to do is take three arbitrary
registers from this register file, 
perform the multiply-accumulate,
and then write back to the register file.
It's going to write to this one,
but it was able to read from this 
one, this one, and another random one.
It will take three inputs like this.
This is the core data path of many processors.
Most processors look like this.
You've got some set of registers, and
then you've got some set of logic units, or ALUs.
We want to analyze the cost of the data movement
from the register file to the ALU and back.
Ultimately, there's going to be some
circuit that says, "Well, I don't 
always have to select this guy.
I might select any of the 
registers at any point in time."
The first question is: how can I build a circuit?
The circuit I'm going to look for is a mux.
In this case, it's going to have eight inputs, 
one from each entry of the register file,
and it's going to have one output, 
which is actually producing this output.
What is the cost of this thing?
All we have to build it out of is AND and OR.
How do we build it?
We do the dumbest thing possible.
We form a mask. When we want to read the 
third entry, we're going to AND every
single entry with either 1 or 0 based 
on whether that's what we want to read,
and then we're going to OR all of them together.
Just to make sure I understand the basics.
What the mux is doing is just selecting an input?
Just selecting, invisible to software.
You say "I want input number three," 
and that means there's a mux here.
So what is the cost of this mux?
An n-input
mux operating on p bits.
I have n rows. That's just eight
rows, and each row is p bits wide.
I have to AND every single bit,
so I get n x p many AND gates.
For every single input I have to decide
whether I'm going to mask it out or not.
Then I'm going to OR them all together.
There's going to be n – 1 times p many OR gates.
I've got all of these different things,
almost all of them are 0s, but I need to collapse 
them from my eight options down into one option.
Every step, I need to OR one 
row into an existing row.
It's actually funny that you don't 
think at the level of hardware.
You just think, "Oh, I'll just select element 
three," and something as simple as that is
in and of itself quite a complicated circuit.
This is the first step of all of the hidden
data movement costs that show up.
We're just going to compare.
I have to pay this cost.
I've got one mux here, and in fact
I have two more copies of that for each of the 
three inputs to my multiply-accumulate operation.
I have this cost, which is 3 x n x p AND gates 
over here, compared to p x q gates in the actual
circuit that is doing the thing I care about.
If we plug in actual numbers, with n being
eight, I get 24 x p gates just in the data 
movement, compared to—if q is four—4 x p
gates just in the multiply-adder.
Where is the three coming from?
Three different inputs here. What I'm 
hinting at is that all of this work,
which scales as the size of the register 
file—and this is a very small register
file—all of this work just moving the data from 
the register file to the logic unit is many,
many times more expensive than the logic unit.
In the most recent ClusterMAX report, SemiAnalysis
ranked almost 100 different GPU clouds. Crusoe 
was one of only five that made the gold tier.
SemiAnalysis found that gold-tier providers 
like Crusoe had a total cost of ownership that
was 5% to 15% lower than silver-tier ones, 
even when they had identical GPU pricing.
This makes sense because total cost of ownership 
is downstream of a bunch of different things that
don't necessarily show up in the sticker 
price, but that Crusoe has optimized.
Things like how well you detect faults 
and how quickly you replace failed nodes.
For example, Crusoe was one of the first clouds 
to adopt NVSentinel, NVIDIA's own GPU monitoring
and self-healing software for enhanced 
GPU uptime, utilization, and reliability.
This lets Crusoe make use of everything that 
NVIDIA has learned about why chips fail across all
their different fleets and deployments, so that 
Crusoe can catch faults earlier in the process.
And once they identify a failure, Crusoe can 
swap in a healthy node in less than 10 minutes.
Because they’re not running bare metal, Crusoe 
doesn't have to spend time installing an
operating system or configuring drivers.
They can just spin up a new VM on an
already running and pre-qualified host.
If you want to learn more about this or
the other reasons that Crusoe made 
gold tier, go to crusoe.ai/dwarkesh.
It may be helpful to just 
see what a mux looks like,
maybe a two-bit or a four-bit mux.
We’ll do a two-way mux.
We've got two different numbers, 
we’ve got these two inputs.
These are the inputs that are being selected 
between, and the selector can either be "I want
this one" or "I want the other one."
This is a one-hot encoding.
This is what we start with.
Let's focus on this case.
This is the actual input we got, and we 
want to produce this guy as the result.
Very laboriously, we AND 
this bit with all of these.
That produces ANDing this bit with this row.
Likewise, we AND this bit with this row.
That produces all 0s. There are four ANDs here.
Finally,
we OR these two together, and this gives a 1.
We OR these two together, this gives a 1.
We OR these two together, it gives a 0.
We OR these two together and it gives a 1.
Those are the four ORs.
This actually ends up
looking a little bit like addition.
We did exactly the same set of ANDs.
We've ANDed all of these things together, 
but then instead of collapsing it by using
full adder circuits, we just get a 
very simple collapsing with OR gates.
But that doesn't look like n times p.
This was with n=2 inputs.
In the general case, we will have n 
rows, and we'll have p bits per row.
That gives us the n times p many AND gates.
In this circuit I've described, almost all of
the cost, seven-eighths of the cost, is in reading 
and writing the register file, and only a tiny
fraction of the cost is in the logic unit itself.
This is the problem to solve.
This essentially was the state of play prior 
to the Volta generation of Nvidia GPUs.
This kind of thing is what 
was inside the CUDA cores.
This problem statement is what motivated 
the introduction of Tensor Cores,
which are more generically called systolic arrays.
Think about how we're going to solve this problem.
We're spending almost all of our 
circuit area on something that we
really don't care about and is hidden to the 
software programmer, and the thing that we
actually care about is not much of the area.
Make this one bigger somehow while keeping
this at the same size.
That's the goal.
The evolution was that we had baked 
this much into hardware at this stage.
This single line is a multiply-accumulate, 
and this single thing was baked into hardware.
The idea of a systolic array is to go 
two levels of loops up and bake this
entire loop out here into hardware.
The idea is that if we have a much
bigger granularity fixed-function 
piece of logic, maybe the taxes we
pay on the input and output are much smaller.
Interesting. It sounds like you're suggesting
that if you go up one step in the matrix 
multiply loop, you can tilt the balance
more towards compute than communication.
That's right. There are two effects we're
going to take advantage of here.
One is that we can do more stuff
per every trip through our register file.
The other is that in some of this loop, we can
take advantage of certain things staying fixed.
Visually, we're going to look at this
matrix multiplication.
This portion of the loop
corresponds to a matrix-vector multiplication.
We'll take a matrix and multiply it by a vector.
How do we do this?
Every column gets
multiplied by the vector and then summed.
We're going to sum along columns.
This 0 and 3 gets multiplied by the 3 and 
7 and gets summed, and then the 1 and 2
gets multiplied by the 3 and 7 and gets summed.
There is a multiply-accumulate associated with
every single one of these entries in the matrix.
We'll draw out these four multiply-accumulates.
Just to make sure I understand why there are 
four multiply-accumulates: each entry in the
column that corresponds to the output 
vector is a dot product, and in this
case it will be two multiplications and then 
the addition of those two multiplications.
You're accumulating...
Really there's only one addition per
dot product, but we like to start with zero.
But it includes the initialization of zero.
Yeah. We want to have quadratically more compute.
We have x times y as much 
compute as we had before.
But we want to aim for having only 
x times as much communication.
The intention is to get this 
advantage term going as y.
We've laid down the multiplications.
We want to bring in a vector of size two, and
that is already in line with our columns target.
That's fine. However, we need to manage the
communication of this matrix, 
which exceeds our budget of x.
The idea is that in an AI context, this matrix 
is going to stay fixed for a long period of time.
We've got some register files sitting over here.
The amount of stuff coming out of this register
file… this is the term that we 
want to go as x, in some sense.
We don't want to bring this full matrix 
in from the register file every cycle,
because that would cost too much in 
terms of wiring from the register file.
Our key trick is that this matrix can 
be stored locally to the systolic array.
We'll store these numbers 0, 1, 2, and 3 
in a gate called a register that physically
stores these numbers, and we're going to 
reuse these numbers over and over again
for a large number of different vectors.
The optimization here is that the nature of
matrix multiplication is that you can store this 
square quadratic thing directly where the logic is
happening, which has an extra dimension compared 
to the inputs that you keep swapping in and out.
That’s right.
This is the nature of
what a matrix multiplication is.
You do a lot of multiplication
to get one value out.
A dot product is the
result of a lot of multiplications.
So that optimization means you can
stuff a lot of multiplication in 
before you get some value out of it.
That's right. Just to complete the 
picture of concretely how that looks:
I swapped the 3 and the 2 here.
Just like this 0 and 3 is going
to multiply by the 3 and the 7, we're going 
to form a dot product along columns here.
We're going to feed a 3 and a 7 in here.
This feeds into this multiplication and also
feeds into this multiplication.
Likewise, the 3 feeds
into here and also into here.
Then we're going to sum along here.
Starting at the top of a column, we feed in 0s, 
and then coming out the bottom we get results.
Visually, there's a dot product performed along 
columns in a matrix, and that maps exactly to
what is done spatially in the systolic array.
This is one dot product summed vertically,
and this is a second dot 
product also summed vertically.
What is the data that needs to go 
into and out of the register file?
We have x amount of data coming 
out on the output, and we also
have x amount of data coming from the input.
With respect to the input and output vectors
at least, we've met our goal of having only x as 
much data going in and out of the register file.
This leaves open the question: I said the 
weight matrix is stored locally in the systolic
array, so how did it get there in the first place?
At some point, you need to boot your chip and
populate this data, so where did that come from?
The trick is that we just do it very slowly.
We very slowly trickle-feed 
it into the systolic array.
The simplest strategy is that we run 
this daisy chain: feed a number in here,
and on the next clock cycle it will move 
down to the next entry of the systolic array.
We can do that in every column in parallel, 
this is also going to come from here,
and that gives us another factor of 
approximately x units of bandwidth coming in.
Would you mind repeating 
that sentence one more time?
We know that we're going to be bringing 
numbers only rarely into the matrix.
We just want to come up with any construction 
at all such that the amount of wiring that
crosses the boundary of the systolic 
array is bounded to x and not go as xy.
A particularly simple strategy is that 
we bring a number into the top row of
the systolic array in one clock cycle.
Then, for y consecutive clock cycles,
we bring in the top row every time and 
shift all the other rows down by one.
This keeps the wiring that needs to 
come from this expensive register file
only down to a factor of x rather than xy.
I see. There are two questions in terms
of communication: communication 
time and communication bandwidth.
You're saying that since we're only going to be 
loading this in once, let's minimize bandwidth,
because bandwidth equals die area.
We load it in slowly over smaller
lanes because we're just going to 
keep this value in there for a while.
Exactly.
It's interesting to me that when we were talking
last time about inference across many chips, the 
big high-level thing we're trying to optimize for
is increasing the amount of compute per memory 
bandwidth, that is to say, per communication.
Here also, we're trying to increase 
the amount of actual multiplies
or additions relative to transporting 
information from registers to the logic.
In both cases, you're trying to maximize 
compute relative to communication.
This shows up all the way up and down the stack.
This is close to the bottom, to the gates.
There's a version that's maybe even closer 
to the gates in the precision of the number
format that you choose to use.
We saw that same effect.
There's a squared versus linear term going 
on both purely in the precision of the ALU,
but also in the size of the matrix.
This unit is the next bigger unit.
We had the multiplication circuit, and on top 
of that we have a pretty large systolic array.
I drew it as 2x2, but older TPUs were described 
as 128x128 of this circuit shown here.
This ends up being the most efficient known 
circuit for implementing a matrix multiply.
We've talked about how it seems 
obvious that you should try to
maximize compute relative to communication.
What are non-obvious trade-offs that
keep you up at night, about whether you should do 
X or Y and it's not obvious what the answer is?
Most of the decisions in chip 
design are sizing decisions.
Already in what we've drawn so far… AI 
chips all have this circuit in them.
They have a systolic array, and somewhere near 
it a register file providing inputs and outputs.
Even within this scope, the sizing questions you 
have are: how big should I make my systolic array,
and how big should I make the register file?
These two questions are coupled.
One way to think of it is to set a 
budget for what percentage of chip
area you want to spend on data movement.
Maybe I say that I want this to be 10%
and the systolic array to be 90%.
Then I can size my register file.
Bigger register files are more flexible.
They allow me to get more application-level
performance out, but they take away from 
the area spent on the systolic array.
I recently ran an essay contest where I asked 
people to write about what I consider to be some
of the biggest open questions about AI.
The submission window closed last week,
so I used Cursor to create a couple of different 
interfaces to help me review the entries.
One interface anonymizes submissions 
and hides the necessary information.
It lets me group responses by question, 
add notes, and record my scores.
The other interface helps me review 
entrants who also want to be considered
for the researcher role that I'm hiring for.
The UI puts the applicant's essay right next
to their resume and their personal website 
so that I can see everything at once.
Cursor's harness is really good at helping 
these models see and improve their UIs.
I watched it render these interfaces in 
the built-in browser, take screenshots,
click through sections, and keep iterating.
At this point,
Cursor is where I do most of my work.
Whether I'm reading and visualizing a bunch
of research papers, or coding up an interface to 
review applications, or making flashcards for my
blackboard lectures, Cursor just makes it very 
easy for an AI to look at whatever I'm looking at
and help me understand it and work with me on it.
So whatever you're working on,
you should do it in Cursor.
Go to cursor.com/dwarkesh.
Where does the clock cycle of a chip come in?
What determines what that is?
And what is a clock cycle of a chip?
At baseline, it's worth observing
that chips are incredibly parallel.
You've got 100 billion transistors in a chip.
A key thing you need to do whenever you 
have massive parallelism is synchronize
between the different parallel units.
In software, typically you have these very
expensive synchronization methods like a mutex.
One thread will finish what it's doing,
grab a lock stored somewhere in memory, 
and notify the other thread that it's done.
On chips, we take a very different approach.
Every nanosecond or so, all circuitry in the chip
will pause for a moment and synchronize.
That is the clock cycle.
The entire chip typically goes in lockstep 
to the next operation in one fell swoop.
What this looks like in circuitry is that 
the clock is mediated by registers, which
are these storage devices we've drawn elsewhere.
The way to think of it is: I have some storage
holding a bit, which might be 0 or 1.
Then I have some cloud of logic,
which maybe is this systolic array or multiplier.
I have a bunch of inputs feeding into this cloud
of logic, and eventually there's going to 
be some output register that it writes to.
There is a global clock signal 
driving all these registers.
At a certain instance in time, when the clock 
strikes, whatever value happens to be on that
wire at that instant is what gets stored.
The challenge is that I would like to have
my clock speed run as fast as possible.
If I run at two gigahertz, I get twice as
many operations done per second 
as if I run at one gigahertz.
But what that ends up meaning is that I'm very 
sensitive to the delay through this cloud of
logic, because any computation happening in there 
needs to finish before the next clock cycle hits.
A major point of optimization on any chip 
is to make this delay as short as possible.
Interesting. The constraint here seems 
to be that if you add too much logic,
you might risk missing the clock cycle.
But if you don't add enough, you're
leaving potential compute on the table.
Is there ever a situation where you take
a probabilistic chance that a computation 
finishes, or is it strictly that it either
finishes by the clock cycle or it doesn't?
In standard chip design, you margin it such
that there is a probability, but 
it's many standard deviations out.
For all intents and purposes, it is a 
reliable part and will always meet the clock.
There are some weird exceptions, like clock domain 
crossings where you go from one clock to another.
Then you actually do have to 
reason about this probability.
But in the main path, you margin it such that 
you'll get there 25% of the clock cycle in
advance, making it very unlikely that it misses.
Where the clocks synchronize, where the
registers are, is this something 
you determine as a chip designer?
Or is it an artifact where you want a certain 
sequence of logic, and the software you use to
convert your Verilog into what you send to 
TSMC just determines that to make it work,
you have to put a register here, here, and 
here, making sure no single step makes the
whole chip's clock cycle longer than it has to be?
Inserting them is actually a huge part of the
work of designing a chip.
It's done by a combination
of manual and automatic methods.
To show the very dumb version of
what you can do here, you can take 
this logic and split it in half.
Instead of just one cloud of logic, I can have two 
smaller clouds of logic that do the same thing,
but split them up by a register.
If you split it in the middle,
you can hit twice the clock frequency.
That's great, you get twice the performance,
but at the cost of an extra 
register, which means more storage.
Stepping back, why do we need 
to synchronize the whole chip?
If you imagine playing Factorio or 
something, there's no global clock cycle.
Things are just done when they're done.
There's iron on the plate, and you can
take it if you want.
Taking that analogy,
the thing you need to be mindful of is if I 
have two different paths through some logic.
Say I have to do computation 
f here and computation g here,
and they're going to meet for computation h.
There's going to be manufacturing variance.
In some chips f will take a little longer; 
in some chips g will take a little longer.
If I have a signal propagating through, and 
the results from f and g have to meet up at h,
what can go wrong is that f gets 
there early and it meets the previous
value of g, or the next value of g.
Ah. And h needs to know when to start,
when the next iteration has…
Exactly.
This explains why different chips made 
at the same process node, the same TSMC
technology, can have different clock cycles.
Two chips made at 3 nm might have different clock
cycles based on whether they were able to optimize 
to ensure no single critical path is so long that
it slows down the whole chip's clock cycle.
That's right. This optimization I showed here
is called pipeline register insertion.
We've inserted a register in the
middle of the pipeline.
This is a pure trade-off
between clock speed and area.
That is the easy case.
There is a harder case too.
I drew out a pipeline of logic,
but in other cases you may have some calculation 
which actually feeds back in on itself.
It runs some function f and 
then writes back to itself.
For example, this might be an addition where 
you're adding a number every clock cycle.
This little circuit essentially sums all the 
numbers presented on different clock cycles.
The challenge is, if this plus 
takes too long, what can I do?
If I try to put a pipeline register 
right in the middle of it, it
changes the computation that's done.
Instead of forming a running sum
of everything that comes in, I will 
actually have two different running sums.
I'll end up with a running sum of the even 
numbers and a running sum of the odd numbers.
This constraint—where I have a loop in my logic, 
which all chips have somewhere—is the hardest
thing to address and sets the clock cycle.
I don't understand why it would be
a problem to have that.
I'm not even sure what it
would mean to have a register there.
Is it a sort of atomic operation?
Well, plus is not really atomic.
As you just demonstrated.
It took a whole lot of work to do a summation.
You can take the early parts of that work,
stick a register in the middle, and 
then take the late parts of that work.
Okay. TSMC offers a PDK which 
specifies the primitives of
logic they can grant you in the chip.
It's up to them to determine that no
primitive is bigger than the clock cycle 
they're hoping a process node targets.
But other than that, can't you just take 
all the primitives from TSMC and keep adding
registers between them as much as needed 
until you get to your desired clock cycle?
As a logic designer, the chip 
architect sets the clock cycle.
For example, the primitives you get from TSMC 
are on the order of AND gates or full adders.
It depends a lot on voltage and which library you 
choose, but generally you can have about 10, 20,
or 30 of these sequentially in a clock cycle.
These primitives are very fast,
maybe 10 picoseconds.
As a logic designer, in principle,
if you just had a register and an AND gate in a 
loop, you could get an insanely fast clock speed,
more than four, five, or six gigahertz.
But if you take this really simple circuit
and look at the area you're spending 
here… This is called one gate equivalent
in size, so unit of one in area.
This thing is maybe a unit of eight in area.
Again, almost all your cost 
becomes synchronization or
communication cost compared to the actual logic.
This would be a case where you've gone too far.
You've made your clock speed really 
fast at the cost of spending almost
all of your area on pipeline registers.
Interesting. So you're hinting at a dynamic
where you can have a really fast clock speed 
but you're not getting that much work done.
You can have low latency but low throughput.
It hurts your throughput, in fact, because
the throughput of your chip is the product of 
how much you get done per clock cycle—which
is based on area efficiency—times 
how many clocks you get per second.
This is actually so similar to the thing we 
were discussing last time about batch size,
where if you have a low batch size, any one 
user can receive their next token really fast,
but the total number of tokens processed in, say, 
an hour will be lower than it could otherwise be.
Exactly. You get less parallelism out if 
you drive your clock speed up really high.
Language models are starting to compete 
against the best human forecasters.
I sat down with two senior Jane Streeters, Ron 
Minsky and Dan Pontecorvo, and asked: at some
point, does AI just do what Jane Street does?
There's a world that we should take seriously
where, you know, we're going to build large 
language models or some other AI systems that
are like strictly smarter than all humans on the 
planet and more capable at all cognitive tasks.
Trading in particular feels to me as like kind 
of AGI complete, sort of like NP complete,
because at the end of the day trading involves 
figuring out what things are worth, which means
making predictions about the future.
Jane Street isn't betting against AI.
They just signed a six 
billion dollar compute deal.
But Ron's view is that the edge keeps moving.
I have never been more desperate to hire more
engineers and more traders than I am today.
You know, you have the usual thing of like
the other hard parts that we 
don't yet know how to automate.
Well, that ends up being where 
the competitive edge lies.
You can find these open positions and watch 
the full interview at janestreet.com/dwarkesh.
I remember talking to an FPGA engineer at 
Jane Street, Clark, who helped me prep for
the previous interview we did together.
He was explaining why they use FPGAs.
I imagine that for high-frequency trading, 
throughput is less important than latency,
so having very specific control over the clock
cycle in a deterministic way 
is the most important thing.
Maybe it'd be interesting to talk about why 
you can't just achieve that with an ASIC,
or why you might use an FPGA to have deterministic 
clock cycles for high-frequency trading.
Let's consider the business 
case for an FPGA versus an ASIC.
FPGAs and ASICs use largely 
the same conceptual model.
You have a series of gates built from small 
primitives—ANDs, ORs, XORs—connected together
with wires running in a fixed clock cycle.
Anything you can express in an FPGA you
can express in an ASIC too.
It will be about an order of
magnitude cheaper and have better energy 
efficiency on an ASIC than an FPGA.
The trade-off is that the first FPGA costs you 
$10,000, whereas the first ASIC you make costs $30
million because it requires an entire tape-out.
The business use case for an FPGA is when you want
something that has very deterministic 
latency, fast runtime, and high parallelism,
but you are going to change the 
workload frequently, maybe every month.
You don't want to pay that 
tape-out cost every time.
How does an FPGA actually emulate the ASIC 
programming model in a fixed piece of hardware?
At its core, it has the two 
components we just talked about.
It has registers as storage 
devices, and it has lookup tables
(LUTs) which provide all of the gates.
Then there's a third component.
We have a swarm of these registers and LUTs, 
and they are connected by a big set of muxes.
In front of every single one of these, we have a 
mux which selects an input from everywhere else.
We have a whole bunch of different 
options feeding into all of these things.
What this allows is essentially that when 
I program my FPGA, I can take all of these
components and superimpose a particular wiring 
which goes through this LUT, feed it into another
LUT, send it to this register, and then feed 
it into another LUT, or something like that.
What I've drawn in orange is how you… 
FPGA means Field-Programmable Gate Array.
The orange is what has been programmed in 
the field, whereas the white is all the
wires that must exist in the FPGA in order to 
actually make the device in the first place.
What does it mean to be programmed in the field?
Programmed in the field means the device is
being deployed in a data center.
It's sitting out in the world,
and then you can come and program it.
Ah, not field as in like electric field.
Field as in like out there in the world, ok.
If I look at how the field programming comes
out of the first lookup table and goes 
into a second one, how does that work?
Where are the wires that make that happen?
I got a little bit lazy in drawing all of these.
Every single device here has a mux sitting in 
front of it, which can select from all of the
nearby circuits that are available.
The actual configuration of the
FPGA amounts to the mux control.
In this mux, we have the data inputs,
and we have the control that selects.
There's a little storage device sitting next
to every single one of these muxes saying, "This 
is where you're going to source your input from."
Programming it consists of configuring 
every single one of these muxes.
That makes sense. What is happening 
inside of the lookup table?
The lookup table is also going to have a 
little bit of control telling it what to do.
Its purpose is to configurably 
take the role of an AND gate,
OR gate, XOR, or any of those different things.
There are many ways you could consider doing that.
The way it's done in traditional 
FPGAs… A lookup table has four bits
of input and one bit of output.
How many different functions are
there from four bits to one bit?
There are 16 different functions.
You can tabulate this as 16 different numbers.
You've got a table of 0111001, 16 entries.
This table is stored in 
this blue configuration bit.
It views these four bits as binary, looks up the 
relevant row of the table, and emits that bit.
This is essentially a truth-table 
view of lookup tables.
Okay, so if you think about an AND gate, OR gate, 
NOR gate, XOR gate, these all take as input…
Those are two-input functions. Sometimes we have 
a three-input function, like a three-way XOR,
or a four-way XOR.
In this case,
does it just depend on how big it is?
The typical size for LUTs is four inputs.
It's sort of just a sweet spot.
There’s another compute vs. communication
trade-off here.
If it has too
few inputs, you need to use more LUTs.
Basically the lookup table is a truth table.
With a truth table, you can 
program in any gate you want.
So instead of a lookup table, you can 
just think of it as a programmable gate.
That's right. One of the things you can do 
here is you can see where the rule of thumb
that an FPGA is an order of magnitude 
more expensive than an ASIC comes from.
You count how many gates would 
be inside this lookup table.
We can view this lookup table 
essentially as one of these muxes.
It has to select between 16 different values,
so it's a mux with n=16 options and p=1 bits.
As we saw earlier, this circuit
costs n times p many gates.
So it costs np, which is 16,
AND gates, and also 16 ORs.
This circuit being the mux?
Exactly, the mux.
The mux that goes into the lookup table?
The lookup table itself you can think of 
as being a big mux that selects from all 16
rows down to one output.
That's the lookup table.
But the way you've drawn it here, 
there's a mux and then a lookup table.
It's muxes all the way down.
There is a second mux that is inside here.
This mux is this mux.
And the other mux is just saying…
where it came from in this mess of gates.
Right, and the second mux is, "Okay,
now you have one value, but that 
value is still a four-bit value."
Yeah, I've selected four bits from the soup.
Then I use those four bits to select which
entry in the lookup table I'm going to use.
Suppose in the first mux you're pulling
from eight nearby registers as input.
That's a total of 32 bits going in.
Out of that, four bits come out.
Those four bits go into the second mux,
which is inside the lookup table.
In this case, these registers
are single-bit registers.
If there are eight nearby
registers and lookup tables, then I 
have eight bits total coming in nearby.
I select from eight down to four different values.
There are actually four different muxes, a little
mux associated with each of these input bits.
Each of them is selecting one out of eight.
Where are those eight coming from?
Nearby registers and other LUTs.
Each register is one bit.
Yes.
I guess AMD or whoever makes these FPGAs 
still has to be opinionated about which
registers are connected to which registers.
You can program in the actual gates,
but they add a wire in the connect… 
the communication topology, right?
You get flexibility at a local grain.
There's a nearby neighborhood you can select from,
but for more coarse, long-distance 
connections, they form an opinion on that.
And the reason it's 10x slower is why?
If you look at the cost of building this
lookup table, it's 32 gates.
It can give me the equivalent
of—what’s an interesting thing I 
can do here—a four-way AND gate.
A four-way AND means AND, AND, 
and then an AND of an AND.
This is a circuit I could implement in 
an ASIC directly using three AND gates.
Using a LUT, I can also implement it, but 
it's going to take 32 gates instead of three.
So the overhead is really coming from 
the fact that there's a more concise
way to describe a truth table than listing out 
every single possible combination of inputs,
which is just to write out the gate.
Yes, to place down the polysilicon
and the wires and so on.
Interesting. One important point you
made to me is that the reason they prefer FPGAs to 
CPUs is that they get deterministic clock cycles.
They know when a packet will come in and go out.
Why isn't that a guarantee in CPUs?
You can actually design a CPU that 
has deterministic latency as well.
In fact, the processors inside a lot of 
AI chips also have deterministic latency.
Groq has advertised this. TPUs 
have that in the core as well.
The challenge is getting deterministic 
latency and high speed at the same time.
Non-deterministic latency comes from 
specific design choices in a CPU.
It's actually possible to remove those design 
choices and make a CPU with deterministic latency,
but those are not very attractive in the 
market, so people don't make those CPUs anymore.
In some sense, deterministic latency is a simpler 
starting point, and some chip designers have
added things in to make it non-deterministic.
To take a concrete example, probably the most
important source of non-determinism 
on a CPU is the CPU cache itself.
In a CPU, you have the CPU die itself, 
and then DDR memory off on the side.
You have a cache system inside that remembers 
recent accesses to DDR and stores them.
When I'm running through my 
CPU instructions, every time
I have an instruction that accesses memory, it 
first checks if the data was stored in the cache.
If not, it fetches it from DDR.
This is a huge optimization.
The cache is two orders of 
magnitude faster than the DDR.
If you never used the cache, basically all 
programs would run a hundred times slower.
The presence of a cache is absolutely necessary 
for a CPU to run at a reasonable speed.
But whether or not you get a cache hit 
depends on the ambient environment of the CPU:
what other programs are running, what has 
run recently, and what the random number
generator inside the cache system is doing.
That is a big source of non-determinism
in the runtime of a CPU.
That is the memory system for a CPU.
The big thing you can do differently 
is, instead of having the hardware say,
"I’m going to read memory" and then the 
hardware decides whether or not it comes
from the cache, you can bake this decision 
into software, a different design philosophy.
You see this in TPUs, for example.
I'll draw the same diagram,
but I'll call it a scratchpad.
The main difference is… This would be a TPU,
and you have HBM in this case rather than 
DDR, but it's still an off-chip memory.
Instead of the software saying "first access 
memory" and letting the hardware decide,
you have one kind of instruction that goes to 
the scratchpad and a totally different kind
of instruction that goes to HBM.
This style is generically known
as scratchpad instead of cache.
The key distinction is that you have
one kind of instruction that says "read or write 
scratchpad," and a totally different instruction
that says "read or write HBM."
So scratchpad being the cache.
Yeah, this thing here is the scratchpad.
Stepping way back: people say computers have the
"von Neumann architecture", where there's 
this serial processing of information.
Maybe it's just because we've 
been talking about parallel
accelerators, but the FPGA is super parallel.
The AI accelerators, the TPUs, are super parallel.
Even CPUs are super parallel if you 
think about all the cores they have.
In what sense is modern hardware 
actually the von Neumann architecture?
Is that actually a fair way 
to describe modern hardware?
I think it's a fair way to describe CPUs.
The amount of parallelism you get on a CPU is
about 100 cores times maybe 16-way vector 
units, so about 1,000-way parallelism on a CPU.
One question: there is a die being used for 
the CPU, and if there are fewer threads,
just as a matter of transistor voltages 
switching on and off, is it that there's
literally one control flow—a small part of the 
die—where voltages are switching on and off?
How do you actually occupy the die area of a CPU…
If there are so few cores, what are
you spending all of the die on?
Yeah, what is happening there?
The cores are just much 
bigger and more complicated.
We should compare a CPU core, which takes 
up one one-hundredth of the die, to a LUT.
A LUT is only 16 gates.
It's clear why there are so many
more LUTs in an FPGA than cores in a CPU.
But why are there more CUDA cores,
for example, than CPU cores?
What's the difference between a CPU and a GPU?
Inside the CPU, one big use 
of the area is the cache.
Very little is actually the ALUs.
Mostly it's these register
files rather than the logic units.
Both of those have equivalents in a GPU,
so that's not a big difference.
But the thing that does not have
an equivalent in a GPU is the branch predictor.
There is a whole big area in the CPU which is
just a bunch of predictors saying when the next 
branch will be and where the branch target is.
Stripping a lot of that out, as well 
as making these register files tighter,
drives a lot of the GPU gains over the CPU.
What is the purpose of the branch predictor?
To execute both branches at 
once, or what does it do?
The issue is that when I've got a series 
of instructions, if I have a branch,
the actual step of processing an 
instruction takes a really long time.
It takes maybe five nanoseconds.
The time to notice that I've got a branch,
evaluate whether the Boolean is true, update 
the program counter to the new target,
and then read from the instruction memory 
could take five nanoseconds to finish.
So in reality, this may 
finish somewhere down here.
I want to run a clock speed that is much 
faster than what five nanoseconds allows.
Five nanoseconds is a 200 MHz clock speed.
I would like to run at one or two gigahertz.
So I need to run other instructions 
while the branch is being evaluated.
I just want to keep running the 
instructions that happen after me.
But that might have been wrong.
If the branch ended up being taken, then
I need to know that instead of evaluating these 
instructions, I actually need to jump to wherever
the target is and run those instructions instead.
The purpose of the branch predictor is to predict,
five cycles earlier, that a branch is going to 
happen, before you even get to that instruction.
If I think about how the brain works 
versus what you're describing here,
at a high level the differences might be that 
while you can do structured sparsity in these
accelerators and save yourself some area that you 
would have otherwise had to dedicate to gates,
in the brain there's unstructured sparsity.
Any neuron can connect to any other neuron, and
not in ways where they have the column aligned.
Then there's the fact that memory
and compute are co-located.
Although I guess you could say in a way the memory
and compute are co-located on these dies too.
This is exactly the co-location,
in some sense, of the memory and compute.
So maybe that isn't a big difference.
Another big difference is that the clock cycle 
on the brain is much slower than on computers.
Partly that's to preserve energy, because the 
faster the clock cycle, the bigger the voltage
needs to be in order for the signal to settle 
and to identify what state a transistor is in.
That’s right.
I don't know if
you have any commentary on what the brain 
might be doing versus how these chips work.
Let's take the clock speed one first.
The clock speed is quite high on a
chip because that drives higher throughput.
When we compare a GPU running some workload,
it's running batch size 1,000.
Whereas the brain is not running
batch size 1,000, there's only one of me.
You could imagine saying, "Take a GPU and
instead of running at a gigahertz, run 
it at a megahertz," and that would start
to look a little more like the equivalent 
things you're talking about in the brain.
But in the way silicon works, that does not give 
you a 1,000x advantage in energy efficiency.
What it ends up looking like is you just 
run this circuit once to stabilization,
and then it will sit idle 
for a long period of time.
It doesn't consume a lot of 
energy while it's sitting
idle because most of the energy is consumed 
in toggling bits from zero to one and back.
Let's talk about the energy 
consumption of a circuit like this.
The way to think of a bit being stored is that 
you've deposited some charge in a capacitor
sitting somewhere in the chip implicitly.
It becomes charged when the bit becomes a one,
and then it becomes discharged 
when it next goes to a zero.
That cycle of charging the capacitor 
and then dumping that charge out
to ground is where the energy is consumed.
This is called the dynamic or switching power,
and it's most of the energy consumption of a chip.
There is some other energy consumption just
coming from the fact that insulators 
aren't perfect, but we'll discard that.
Most of the energy consumption comes from 
toggling from zero to one and back to zero.
If you run a chip much slower and you only 
clock it once every thousand clock cycles,
you will have 1,000 times fewer transitions.
It will be about 1,000 times
less energy consumption.
But it's not a substantial
advantage in energy efficiency.
Okay, so you described how a TPU
works at a high level.
What is the difference
at a high level between how a GPU and a TPU work?
There is a high-level organization principle that
is different, and then inside 
the cores things are different.
Looking at the high level, we'll 
take a GPU and a TPU and see what
the top-level block structure looks like.
If you think of this as the whole chip
in each case, the organization 
of the GPU is mostly a bunch of
almost-identical units, which are the SMs.
They've got an L2 memory in the middle,
and then a bunch more of these SMs on the bottom.
So there is a fairly regular grid of cores.
If we look at a TPU in comparison, you end 
up with much coarser-grained units of logic.
You end up with just a few matrix units, 
which are the big systolic arrays.
In the middle you've got some vector unit, and 
then you've got your matrix units at the bottom.
These matrix units with a vector unit in 
the middle make up the whole TPU chip.
You can think of scaling this thing 
down into a really tiny unit with a
smaller matrix unit and a smaller vector 
unit, and that is sort of what an SM is.
From a very high-level point of view, the GPU has 
a lot of tiny TPUs tiled across the whole chip.
Oh, interesting. You're suggesting the tensor 
core within a streaming SM is analogous to an MXU?
Yeah, it's all very similar.
I see. If you had more lack of structure,
having a bunch of tiny TPUs makes a lot of sense.
Whereas if you just have huge matrix
multiplications, you might want to avoid the 
cost of having individual SMs with their own
registers and warp schedulers.
Why not just make a huge thing
and amortize those costs across the whole thing?
This shows up in how large you can grow things.
We've seen this theme, especially with the 
systolic array, where a larger systolic
array amortizes the register file costs better.
This design allows you to have larger systolic
arrays, whereas the GPU design constrains 
you to having small units of everything.
There is a trade-off, however.
Because of this coarse-grained
separation of things, you need to move a lot of 
data from the vector unit to the matrix units,
through just two lines of perimeter here.
If you look at the equivalent thing
in a GPU, you've got vector units everywhere, and 
you can move data through many different lines.
The amount of data you can move between a 
vector unit and a matrix unit is actually
much higher in a GPU than in a TPU.
Instead of having to move all the data
through just two lines, you're moving 
it through 16 lines of wiring in a GPU.
Right. But also you might 
have to move across less area.
Which is an energy saving as well.
So if you can operate entirely within
an SM, the data movement is much smaller.
But the moment you want to operate across SMs,
it becomes more complicated and expensive.
So you don’t have to comment, but one might
expect that a thing MatX might try to do 
is get the GPU-like smaller structure of
systolic arrays surrounded by SRAM, but at the 
same time make it so that the things you need in
an SM to support the CUDA architecture—which 
take a bunch of space—you might discard.
We've talked publicly about something we 
call a splittable systolic array, which
in some sense you can think of as big systolic 
arrays that can be small systolic arrays too.
Cool. Okay, I think that's 
a good note to close on.
Reiner, thank you so much.
Thanks, Dwarkesh.