Today, I'm here with Eric Jang, who was 
most recently vice president of AI at 1X
Technologies, and before that, senior research 
scientist at what is now Google DeepMind Robotics.
You've been on sabbatical for the last few months.
One of the things you've been doing is rebuilding,
improving, and hacking on AlphaGo.
Today, you're going to explain building
AlphaGo from scratch and what it tells us about 
the future of AI research and development.
Before we get to that, why is AlphaGo interesting?
Why is this the project you decided to do on
sabbatical rather than just 
hanging out at the beach?
I like making things, and AlphaGo and Go AI is one 
of those things that really got me into the field.
When I saw the early breakthroughs on 
AlphaGo in 2014, 2015, 2016 and so forth,
it was profound to see how smart AI systems 
could become and the computational complexity
class they could tackle with deep learning.
This is a problem that has long been understood
to be intractable for search, and yet 
it was solved through deep learning.
That was quite mysterious to 
me, and I've always wanted
to understand that phenomenon a little better.
My training is in deep neural nets for robotics,
where the decisions made by the neural 
networks are a bit more intuitive.
But AlphaGo is a problem where the decisions 
are the result of a very, very deep search.
It's always been very mysterious to me how a 
ten-layer network can amortize the simulation
of something so deep in the game tree.
If you plot out how much compute it took
to build various iterations of strong Go 
bots over the years, you can see that in
2020 there was an open-source project called 
KataGo by David Wu from Jane Street, which
achieved a 40x reduction in the compute needed 
to train a really strong Go bot tabula rasa.
I'm not certain if it's stronger than AlphaGo 
Zero, AlphaZero, or MuZero, but it's very strong,
and this is what most Go practitioners today 
train against when they're playing an AI.
Thanks to LLM coding, what took a whole team of 
research scientists at DeepMind and millions of
dollars of research and compute can now be done 
for a few thousand dollars of rented compute.
We should first discuss how Go 
works. How does the game work?
Go is a very simple game that can be 
implemented quickly and easily on a computer.
The objective is to put down 
black and white stones and
try to occupy as much territory as possible.
I might start by putting down a black stone.
Black always goes first. Go ahead. The way 
you capture an opponent's stones is that
for every intersection, if you can surround 
all four of its neighbors with your stones,
then it's cut off from oxygen, if 
you will, and it's a dead stone.
Now I control these four stones as 
well as this empty intersection here.
There are slight variations between Chinese, 
Japanese, and what are called Tromp-Taylor rules.
Tromp-Taylor rules are designed to 
be completely unambiguous, so this
is what all Go AIs train and resolve against.
In typical Go, when humans play, you're actually
not allowed to put this white stone down here. It 
would be instant suicide. In Tromp-Taylor, it's
actually fine. You put it down, and it immediately 
resolves to death, so the outcome is the same.
Let's start over and play a few stones, and then 
I'll explain some more. I'll just start there.
I'm basically playing randomly here, but 
I'm trying to get around your stones and
see if I can surround them.
This move exposes one empty
neighbor for your white stone.
It's akin to a check in chess.
If you don't respond immediately by putting 
one here, then I can immediately capture this.
I see. Because it's the diagonals that 
determine whether you're surrounded or not.
The cross-section, not the diagonals. 
This one is surrounded on three sides,
so you're at threat of losing that stone 
if you don't play one immediately there.
Now you can see that I'm starting to pressure 
you, because by putting a stone here,
you're forced to put one here.
Otherwise, you would have this
two-block to yourself.
Yes. And if you think
through what happens if you were 
to respond here, you can probably
search into the future and deduce what 
I'll do in response once you do that.
You have a lot of confidence in my abilities, 
but I'm guessing you'd put the black here.
That's right, and then I would 
capture all three of these stones.
So I should just assume that 
this little block is gone.
Yes. In Go, it's actually okay to let 
an opponent capture some stones if,
for example, it lets you position to capture 
more stones somewhere else on the board.
This is what makes Go a beautiful game: 
you can lose the battle but win the war.
As the board size increases, the 
complexity of these micro versus macro
dynamics gets more interesting.
Presumably you'd put one here.
So now I would capture this entire 
group, and this would be mine.
There's one more case I want to demonstrate, which 
I actually had a bug in my code for recently.
Let's consider a formation like this, 
with other pieces on the board in play.
Let's talk about how the game ends.
In this territory, who controls these
areas? Is it white or black?
White.
It's actually black, because 
I've surrounded this whole area.
Assuming I have other black stones here, 
it's very hard for you to break this
out of the control of these stones.
And when the final score is tallied,
would these ones also count as being in...
Great question. This is where different rule
sets have different ways of scoring.
We should talk about how you resolve
scores between humans and how you resolve 
scores in computer Go, because there's some
ambiguity in how humans evaluate this.
Most humans would look at this board
configuration and conclude that black has totally 
surrounded white, and white has no chance of life.
We could play out more here, but at 
the end I would capture everything.
However, if you have a way of breaking 
this formation and connecting white to
something outside of it, then it can flip.
This is where it's a little bit hard for a
computer to decide these kinds of things. How 
do humans do it? It's worth thinking about how
humans resolve this, because this will map later 
to how we think about the deep neural network.
Humans basically say, "I think the game 
is done," and then you have to also say,
"I think the game is done."
Then we'll say, "I think these
are my stones," and you have to agree.
If you don't agree, we keep playing.
Essentially, once two humans—their so-called 
value function—agree on a consensus, then the
Chinese rules resolve that.
In Tromp-Taylor scoring,
it's perfectly unambiguous, so it can be 
decided algorithmically by a computer.
If you have this at the endgame, the way you 
score it is that you first count how many
stones you control, and that's unambiguous.
Then you count how many empty intersections
are not touched by your opponent's stones.
These intersections would not count for either
player, because all of these intersections are 
connected to both white stones and black stones.
If this were like this, then 
white would get three points.
This is a little odd because a human would know 
that white is actually losing these points.
But Tromp-Taylor scoring would consider white to 
have all of these points as well as these points.
So that is a very big difference in how computer 
Go scores things and how humans score things.
How does the game end?
The game ends when either
a player chooses to resign or both players 
pass consecutively. Those are the rules.
Now help me crack this with AI.
Let's understand how AlphaGo
actually works and how somebody in the 
audience might be able to implement it.
Let's start with an intuition about the 
underlying search process used to make moves,
and we'll layer on ideas from deep learning 
to make it much more efficient and tractable.
Go is a game with just two players.
We're going to draw a person here,
and we're going to draw an AI here.
Let's say this person is playing black,
so they go first. They go here. Now the AI is 
going to make a move based on what it sees here.
There's a question of how you 
encode these inputs into the AI.
Maybe you could use ones and zeros, but you 
want to represent black, white, and empty.
You would need at least three different values.
Maybe you could use zeros, ones, and twos.
The AI might see something like 
zero, zero, zero, zero, one.
This is the input to the AI on its turn. The AI 
can choose. Let's just pick three possible random
moves it can make, and I just drew these 
at random. Which move is best here? Well,
we don't know until the game ends.
Go doesn't have any kind of local
reward for which move here is good.
This is what makes Go a very difficult game.
You don't actually know who won until you really 
get to the end. How deep is this tree? On a 19x19
Go board, there are roughly on the order of 
361 moves on any given move, and of course,
as it fills up, you have fewer moves.
The number of steps in the game
can be somewhere from 250 to 300 moves.
Experts might decide to end the game well before
that, but under Tromp-Taylor scoring, you actually 
have to play things all the way to the end.
So this could be 300 moves, 
a depth of 300 in the tree.
If you keep on expanding possible moves—here 
the AI goes, then here the human would go,
and so forth—you find that 
you end up with an enormous
explosion in the possible game outcomes 
originating from just this one state.
This is something on the order of 361³⁰⁰, which is 
far more than the number of atoms in the universe.
Of course, there are redundancies and symmetries, 
so it's not actually that, but if you were to do
a naive tree with no merging of children, 
you end up with a tree about this big.
What do you mean by "merging of children"?
Let me use this board here.
If we start here, and then you play here, 
and then I play here, and then you play here,
that's equivalent to me starting here, you playing 
here, me playing here, and then you playing here.
Both arrived at the same spot, 
but through different paths.
So this child node can be 
thought of as a shared ancestor.
Got it. And I guess it starts at 361 
but it decreases by one each time.
Yes, the branching factor 
decreases by one each time.
In any case, this is a very large tree.
This is also why computer scientists for
many years thought that Go was not 
a tractable problem this century.
The amount of compute you 
would need to exhaustively
search every possibility is just too large.
If you could, Go is actually a deterministic game.
In any given state, you can compute 
the best possible strategy you can
make in order to win the game.
You can search all the possible
futures where you win, and then just make 
sure you always stay in that set of futures.
AlphaGo's core conceptual breakthrough was using 
neural nets to make this search problem tractable.
Before we get into how neural networks 
are involved, let's talk about how we can,
assuming we had a powerful enough computer, 
search this tree to find the best move.
In the beginning, you're not 
going to build out the whole tree,
because storing it would be very expensive.
Instead, you might interactively figure out
which leaves of this tree are worthy 
of exploring and expanding into the
future to see what else is there.
There are some early algorithms in
the bandit literature like UCB1, which is not 
exactly appropriate for a sequential game like Go,
but very much inspired the action 
selection algorithm used in AlphaGo.
UCB1 looks like: on every move, we're going 
to take the best action—the argmax over a that
maximizes Q of a, and I'll explain what Q of a is 
in a moment—plus some sort of exploration bonus.
On every node, we're going 
to track a few quantities.
Let's consider each of these a node.
This is the root node where you're
making decisions from, and these 
are the children of the root node.
Each node is basically a data structure that 
stores a visit count of this child node.
How often the parent visited this node.
Yes. One thing that's easy to trip on if you
come from robotics or other kinds of reinforcement 
learning is, where are the actions? I'm only
talking about nodes. Nodes here represent states, 
and because this is a perfectly deterministic
game with no randomness, you actually can 
just infer the action based on the child.
If I go here, that implies an action, 
and this is the state we resolve to.
If you ask LLMs to vibe-code an MCTS 
implementation, they will most likely
design the right data structure here.
But it's sort of a chef's choice.
You can rewrite the tree 
structure however you like.
This is what Claude 4.6 wrote for me when I 
asked it, and it was a very reasonable choice.
Qₐ represents the mean 
action value of this action.
I'll use a subscript a to denote that 
this corresponds to taking a specific
action to get here from the root node.
Taking a gets us to this node here.
We're also going to store the 
probability of taking this action
Again, from the parent?
From the parent, yes. What are the odds that we
sample this one? This will become relevant later. 
We've talked about a deterministic tree for now,
so I'll bring probabilities into this later.
Finally, we have a dictionary of children,
which is just more of these nodes in a 
classic linked list-style reference tree.
This is the basic data 
structure to implement a tree.
In AlphaGo, they use a slightly different 
action-selection criterion called PUCT,
short for Predicted Upper Confidence with Trees.
When you select which child to take,
you do argmax a of Q(s,a) plus a constant.
The equation forms are pretty similar. These
are both scoring criteria. You want to argmax this 
quantity and you want to argmax this quantity to
determine which action to take.
Let's break down the intuition
of how you select actions here.
Q(s,a) is the mean action value,
so how good a given child is on average.
If you actually knew the whole tree,
this is all you need to select the best action.
You don't really need to do more than that.
But if you're interactively building this tree as 
you're figuring out what the Q values should be,
then occasionally you have to try some other 
actions as an explore-versus-exploit trade-off.
In both UCB and PUCT, there's 
this term that basically rewards
taking actions you haven't taken before.
As we mentioned, each node stores the visit
count of taking that specific action. Everything 
is initialized to zero. For a given action,
let's call it action a, initially it's zero.
As n is increasing, if we've already made 10
action selections from that root 
node but we haven't picked a yet,
then this term starts to become quite large for a.
Conversely, if we've chosen a 10 times out of 10,
then this term is quite small. It diminishes 
very quickly. The same thing is true here.
Just to make sure I'm understanding, let me put it 
in my own words. Let's focus on UCB. Conceptually,
you can think of it as two different 
things: the Q and the exploration term.
Let's be clear about what Q is.
Q is basically saying, once we do
these rollouts—you're actually running all these 
simulations—you go down the tree and figure out:
if I end up at the terminal value of 
this tree, do I win this game or not?
You average whether I win or not across all the 
leaves of this tree starting from this node.
That average, you put in Q.
So Q represents the probability that
I'll win this game starting at this node.
That is your sort of exploit.
It's saying: I've run these simulations, 
and I think this is a good move or not.
The other term is saying: have I 
explored this branch enough yet
relative to the other actions I could 
be exploring, or have already explored?
If I haven't explored this branch 
yet, maybe I think it has a low score,
but I just haven't explored that many 
leaves down this node in this tree.
So I should try this even though Q, the exploit 
term, is telling me it's not that valuable.
Because ln(n) grows slower than n, over time 
you will move from the argmax being dominated
by the exploration term, which is the second term 
here, to the argmax being dominated by the Q term,
which is when you've done enough simulations and 
are confident that this is the branch to go down.
Yes, that's right. The motivation for UCB 
was to come up with an algorithm where,
if you don't know the payoff of the different 
actions you can select, this strategy,
given some exploration term here, bounds your 
regret in terms of how wrong you can possibly
be. I don't know the proof. I also don't know 
if this one is proved to have logarithmically or
square-root-bounded regret, but I think the 
algorithm was derived to look something like this.
You can tell these terms grow a little 
differently, and this is to account for the
fact that Go has many more actions for any given 
move compared to your standard bandit problem.
One small clarification: you talked 
about simulations and probabilities.
We should remember that Go 
fundamentally is a deterministic game.
Where does the notion of 
probability come from here?
If you had a very powerful computer, 
there are no probabilities.
You can just compute the true 
average of the mean action value.
So where does the probability come in?
In computer Go before AlphaGo,
we've always done some sort of Monte Carlo method 
where we take the expected Q value averaged over
a randomly selected tree.
That randomly selected
tree is where probabilities come in.
The interpretation of Q is: what is the expected
action value under the random distribution 
induced by some random search process?
Where does the random search process come in?
That's where Pₐ, of action, comes in.
If we assume a naive algorithm where you have a 
uniform probability of taking any valid action,
then this would just be one 
over the number of valid moves.
You would be taking this average 
over a very diffuse tree.
This is a valid integral, but it's very slow 
because you're going to consider a lot of trees
that have very low value.
It's essentially almost
like an importance sampling problem.
Only a few actions and paths contribute high
value, and almost everything else is low value.
So that's a tricky problem here.
This is the action selection criterion 
for how you decide which moves to go down.
As you move down in tree search, you 
will eventually run into a node where
it's quite clear you've won or lost.
At the very end of the game, when there
are no valid moves left to play under Tromp-Taylor 
scoring, you can decide whether you won or lost.
This is the final return of the whole game.
We can assign a value, U, to a terminal leaf
node of the tree, but how do we assign values 
to the nodes prior to that, the parents?
You take the mean action value, 
which is essentially your average.
Suppose these were all leaf nodes.
The mean action value of this node
is just the average of whether 
you won or lost at the leaf nodes.
Correspondingly, you can walk up the chain and say 
the mean action value of this node—let's call it
Qb—is just a weighted average of these ones here.
The weighted average could depend on whether
you have a different sampling distribution.
But the basic intuition is that you want to
resolve the game where you have a deterministic 
win or lose, and then you can go backwards—this
is called the backup step—and assign values 
to these nodes or actions corresponding
to the average over the final terminal leaf.
If you were to do this without neural networks,
it would still be intractable.
You would have trouble finding
which actions to sample.
A lot of the actions would
contribute very low value, especially if you're 
trying to fight your way out of a losing position.
Only a few actions give you high value, so the 
search in practice is still very expensive.
But the idea is that because Go follows a 
tree structure, you can inform a very good
estimate of the value of this node based 
on the downstream values, assuming they're
all correct and you've searched deep enough.
Your earlier explanation—about the sorts of
states where it's obvious to a human who's going 
to win, but deterministically you still have to
play it out—actually drew upon the intuition for 
why 1) the value function is both trainable and 2)
necessary in order to learn this game effectively.
Maybe it's worth defining value
in the first place.
We talked about U being your
final resolution of whether you won or lost.
This is a terminal leaf node condition.
Humans don't play all the way 
to the leaves of the tree.
They stop dozens of moves before, maybe even 100 
moves before in high-level play. How do they know?
You can think about humans as implicitly having 
a neural network called a value function that
takes in a board state and evaluates p(win).
The human glances at the board and knows,
"I'm probably going to lose."
They're essentially running a neural
network that looks at a board and implicitly 
amortizes a huge number of possible game playouts.
They take that average and decide whether the 
board is winnable or not, and whether they should
concede or keep playing. This is remarkable. If 
you think about the beauty of something like this,
a neural network in a human can somehow 
do all of this simulation at a glance.
They just know within a few seconds, without 
actually playing every single game logically,
based on crystallized knowledge and 
experience. They can do this. This
gives us a hint that in games like Go, there are 
ways to radically speed up the search process.
This is one of the fundamental 
intuitions behind why AlphaGo works.
You can train a value function to look at a board 
and quickly resolve the game without playing out
all of these trees to a very deep search depth.
Makes sense. I will say for the audience,
for previous episodes when I was prepping and 
it seemed relevant to understand how AlphaGo
works, I would find it very confusing.
But it's the kind of thing where once you
understand the problem in this way and then build 
the next few pieces, it is actually a lot more
understandable and it makes a lot of sense.
It's okay to be confused right now,
but it's probably simpler to understand, by 
the end of this lecture, than you anticipate.
I’ll just make that note for the audience.
The important intuition at a high level—to
step back about where we're going with all 
this—is that classically, for games like Go,
you could build a tree, but we don't 
have computers powerful enough for that.
Estimating the value of every action you 
could possibly take is also hard because
you don't know until the end of the game.
You could take averages by playing them to the
end, but that's also hard because you don't know 
which actions to take to sample these averages.
Conceptually, there are two problems: the 
breadth of the tree and the depth of the tree.
AlphaGo gives us a way to shrink 
both of those to be tractable.
That's essentially the core idea behind it.
We take the idea that humans can glance at a
board and instantly predict whether we win.
That maybe gives us the opportunity to
truncate how deep we search.
We also know that humans can
intuitively, at a glance, decide what 
moves might be good on a Go board.
These are two things we can use deep neural 
networks for, to accelerate the search process.
Before we talk about neural nets, let's 
go back to how this playout works.
We've only talked about making one move.
The AI looks at this encoded Go board. It
has a tree. It searches deeply into the tree to 
find out which of its actions might be the best,
takes that action, and goes back to the human.
Now the human sees a Go board that looks like
this, and they make their move.
Maybe they put their stone here.
Now we go back to the AI, which 
now looks at a new encoded board.
I've used 2 to denote the AI 
playing as white, 1 to denote the
human playing as black, and 0 as empty.
On the AI's turn, it does the MCTS tree
search all over again from scratch.
It throws away the old tree that it
searched last round.
Now there's a new root
node and it begins to search anew.
You can basically think about MCTS
as a search algorithm that decides 
which moves to play best, aided by
neural networks, and it's done on every move.
Let's talk about the neural network part of this.
While you're erasing, another thing that was 
important for me to understand was about the MCTS
data structure with nodes and children of nodes.
This is done per move and reinstantiated
once a move is made.
A human makes a move,
then the AI looks at this and runs a bunch of 
simulations to figure out what move to make next.
A simulation is exploring one 
more node in the MCTS tree.
Once you run 1,000 simulations, that informs 
the probability of what move to make next,
as you'll explain. That’s what you store. You 
choose the best move given those probabilities.
You discard all of that, the next player 
makes a move, and you restart this process
at the beginning of every move.
Correct. One small addendum:
you don't discard all of that.
You keep one thing behind that we'll use later.
Just like I did for Reiner, I wanted to 
make flashcards for this episode so that
people could retain these concepts.
And ideally, an LLM could generate
some candidates for me to then refine.
But to actually get high quality suggestions,
I needed to design a whole pipeline where 
the AI could take and ingest screenshots of
the blackboard at the right timestamps and then 
make SVG diagrams in case visuals were helpful,
and then run the writing and drawing through 
a critic and then revise the card in response
to this feedback.
It's very hard to
accomplish this just by stacking LLM calls.
This sort of step-by-step recipe works much better
if you have a durable agent that's been engaging 
with the task across all the previous stages.
So I used the Cursor SDK to 
spin up an agent for each card.
The Cursor harness saved me a bunch of work 
in designing some custom context scaffold or
figuring out how to design tool calls for 
taking screenshots or making animations.
These agents all run in the cloud, so I don't 
have to worry about leaving my laptop open.
I just get an email when I 
have candidates to review.
You can check out my cards 
at flashcards.dwarkesh.com.
You can start building with the 
agents SDK at cursor.com/dwarkesh.
Now that we have a basic intuition of how 
moves are made with search, we're going to
talk about how neural networks can speed this 
up by providing an analog to human intuition.
There are two networks.
There is the value network, which takes in a
state and predicts, am I going to win or lose?
It's a binary classification problem.
Then we have a policy network, which induces 
a distribution over good actions to take.
I'm going to draw a one-dimensional flattened move 
distribution, but this is really a square grid.
These are the probability 
distributions over good actions.
Both of these are categorical 
classification problems.
You can train this like any classifier with deep 
learning, cross-entropy loss, that kind of stuff.
The specific architecture 
does not matter too much.
I tried a few different architectures. 
Transformers work, ResNets work. For
small data regimes, my experience is that ResNets 
still outperform transformers and give you more
bang for the buck at lower budgets.
But this may not always be true.
Wait, why is that?
They provide the
inductive bias of local convolutions.
Generally, transformers start to outperform
residual convolutional networks 
when you want more global context.
One interesting finding from the KataGo paper was 
that they found it quite useful to pool together
and aggregate global features throughout 
the network, to give the network a global
sense of how to connect value from 
one side of the board to the other.
What does it mean to aggregate global features?
If you have a very large 19x19 Go board,
and you've got some battles going on here and 
some battles going on there, when you pass
this through a convolutional neural network, the 
receptive fields are going to be good at computing
local things and making that invariant.
But they won't be able to connect these
two features easily.
They need to be pooled
together and attend to each other somehow.
The argument for why transformers are good
for computer vision tasks—vision transformers 
and so forth—is that because they have global
attention across the whole thing, they 
can more easily draw these connections.
But you do need more data there so that you can 
learn the invariant local features through data.
I've tried very hard to make transformers work for 
this problem because I was curious if transformers
would present some sort of breakthrough in 
Go and just remove a lot of those tricks.
But try as I might, I haven't figured out a way 
to make transformers better than ResNets for now.
One more tangential question. It makes sense 
why transformers, with their global pooling
of information, would be better if you need to 
consider information that is not just spatially…
CNNs give you a bias that the things 
next to you are especially relevant.
And then they're aggregated up slowly.
For games where it isn't that relevant
what is happening locally—you just 
have to consider the whole thing—you're
saying transformers would work better.
We’re talking about the spatial dimension.
How about the temporal dimension? Right now 
we're only considering the previous move because
it is a deterministic full-information game.
What if it were something like poker or Diplomacy,
where a bluff they made a while back 
is relevant to understanding now?
And to decide your next move, you need 
to consider all those previous states.
Would that change the 
consideration of what inductive
bias and which architecture is most relevant?
Great question. Go is a perfect information game.
In perfect information games, there does 
exist a Nash equilibrium strategy for which
you can do no worse than any other strategy.
If you know your opponent has a particular bias,
like they love to play aggressively, you can 
in principle counter that specific strategy
better than a Nash equilibrium policy.
But to counter any given strategy,
there exists a single Nash equilibrium that 
can be decided solely using the current state.
That is a design choice AlphaGo made, which in 
hindsight turned out to work very well because
the Nash equilibrium seems to be superhuman.
No human strategy seems to be able to beat it.
There are variations of this where you 
would need to consider temporal history.
This is a very exciting research area, and I'd 
encourage people to fork my repo and try it out.
If you were to play 2v2 Go, you actually 
need to model your partner's behavior.
You may not have information on how they play, 
so you need to aggregate information on how they
play so you can respond accordingly.
These are situations where it's no
longer a perfect information game.
In games of imperfect information
or partial observability, you do 
need context to build a model.
That's a place where things can get very exciting 
in terms of self-play or Diplomacy-style games.
Returning to the neural network, the 
architecture again is not super important.
You can get it to work with transformers.
You can get it to work with ResNets.
I found that for low-budget experiments, 
ResNets work a little better.
You can also use Karpathy-style 
AutoResearch hyperparameter
tuning to make your architecture pretty good.
You don't have to worry too much about that.
You just need to set up the problem so 
that you have a target optimization.
We're going to pick a somewhat arbitrary 
architecture that worked for what I did.
But again, this part is not super important.
You have your encoded board state and, similar
to an RGB, we're going to have three channels.
One channel to encode black, one to encode white,
and one to encode empties or a masked region 
if you want to train on multiple board sizes.
I'm actually not going to talk about multiple 
board sizes for now. That's a bit too
complicated. We have this two- or three-channel 
RGB-like image, and we feed it into a ResNet.
Then we have two branching heads.
One head predicts the value function,
which is a single logit, let's just call it R¹.
Then we have the policy, which is R³⁶¹.
This is the architecture. We're going to 
train this to predict the outcomes of games
given the board state, and we're also going 
to train it to predict what are good moves.
The original AlphaGo paper, called AlphaGo 
Lee, initialized this network with a
supervised learning dataset of expert human play.
Later, they removed this restriction by having
the model teach itself how to play well.
I find it super nice for implementation,
for your audience, to always initialize 
your experiments to something easy and get
the problem working before trying to bite 
off the whole thing and learn tabula rasa.
In deep learning, initialization is everything.
You always want to initialize your research
project to something as close to success 
as possible, especially if you're doing
something new that you haven't done before.
Always pick something that works and then
get it to do something better, 
rather than start from something
that doesn't work at all and try to make it work.
Under that philosophy, it's a great idea to start
with something that has a good initialization.
We're going to take human expert plays
and train this model to predict good actions.
We're going to take all the moves in which
an expert won and predict those actions.
Regardless of board state, whether you won
or lost, you're going to predict the outcome.
You might be wondering, for some of the early
boards where only one stone 
has been put down, how could
you possibly know who the winner of this game is?
Well, if you have hundreds of thousands of games,
on average you'll probably see that boards 
starting like this have half of the games
branch off and win, and half branch off and 
lose. So that'll actually be fine. When you
train this model to predict those outcomes, 
the logit will sort of converge to 0.5.
For these things, it's expected that once you 
train the model, a starting board state will
look like 0.5, and then as you progress towards 
the end of the game, the win probability will
either go up or down. This is your move number. 
As you get hundreds of steps into the game,
it becomes much clearer who's more likely to 
win or lose under your expert data distribution.
I didn't understand the significance of 
why this way of thinking about value is
especially relevant to the expert data.
It is not relevant to the expert data.
It's true for any data that you trained it on.
If you were to learn tabula rasa, you would also
expect this to fall out. Imagine you're vibe 
coding AlphaGo. You gather some expert datasets
from KataGo online, or you have a dataset 
of human players, and you train this model.
It turns out this model is 
already a pretty good Go player.
It will most likely beat most human players.
If you just take this policy recommendation
and take the argmax over these probabilities—if 
you take the argmax and just take this action
as your Go play—it'll be a very fast Go player 
that doesn't think in terms of reasoning steps.
It just shoots from the hip, and 
it'll be a very strong Go player.
This is already quite miraculous if you 
think about how ten neural network layers,
maybe under 3 million parameters, can already 
do something that impressive. You can start
this way. It's important when implementing 
this to verify that this is probably true.
It's good to verify that your Go rules 
are implemented correctly and that you
can run these simulations relatively quickly.
Just as a checkpoint, you want to make sure
you can actually do this basic step before you 
try to layer on more complex things like search.
But we can do a lot better than taking the 
raw neural network and playing the moves.
So let’s apply the neural network 
to improve Monte Carlo tree search.
We start with our root node, and we now have 
a four-step iterative process to do MCTS.
This tripped me up when I was first reading 
the paper and trying to understand it.
Essentially what we're going to do is choose a 
number of simulations, and this number varies.
It can be somewhere between 200 and 2,048.
I believe in the AlphaGo vs. Lee match,
they used tens of thousands of simulations 
per move because they really wanted to boost
the strength of the model as much as possible.
But in training, you don't actually need too many.
KataGo, I think, uses something 
in this order as well.
Do you know if they used… If 
you watch the documentary,
they had a laptop out during the game.
They didn't use the laptop itself. It was on some—
It was on some TPU pod, I think.
Honestly, kind of unfair. Lee is
not using 1E22 FLOPS to do a move.
Fair enough. Interestingly, modern Go
bots don't need that much compute at test time.
What we'll find out, as we talk about how the
MCTS policy improvement works, is that 
over time the raw network actually takes
all of the burden of that big TPU pod 
and just pushes it into the network.
You can do all of that work with 
one neural network forward pass.
The TPU pod will always add the extra oomph on 
top, and that's what they wanted for the match.
We're going to pick this number of simulations 
thing, and for every simulation, we're going
to do several things simultaneously.
We're going to see which moves are
the best in the current tree.
We're going to add extra leaves
to the tree if we get to a point where 
we need to add a leaf, and we're going
to update the action values for the tree.
Every simulation involves this four-step process:
selection, expansion, evaluation, and backup.
At the beginning of our Monte Carlo tree search,
our tree is very basic.
It only has the root node,
or our current board that our AI wants to play at.
We're basically going to select
the best action for this.
When this root node is created,
we also know that we can evaluate it under 
our neural network and get the quantities Vθ,
as well as our probability over actions.
And I’m going to say root.
For all of the actions here, we 
can create a bunch of children.
In this case, I'm drawing a 3x3 board with 
one missing, so there are eight possible
children associated with this root node.
Each of these has an associated probability
of taking that action, so there's P₈, P₁, ... 
P₂, and so on. So at the beginning of our Monte
Carlo tree search, we have our root node, 
and we can initialize it with some children.
The policy network evaluated on 
the root node gives us—on a 3x3
board with one existing stone placed—eight 
possible children that this AI could take.
With each of the children, the policy network also 
gives us the probability of selecting that child.
The first step is to do the selection of the tree.
Again, this is a very shallow tree.
All we have so far is 
essentially a tree of depth one.
Our first move is to select by maximizing, or 
argmaxing, the PUCT criterion, which is basically
Q(s,a) + CPUCT x Pₐ x (√N / (1 + Nₐ)).
For each of these, Nₐ is zero for all
the actions initially. N is zero. So 
we're going to pick according to this.
Initially, the chosen action is most 
likely going to be biased towards
the highest-likelihood action, because 
these are uniform for every node.
Let's suppose P₁ was the highest-probability 
node, so you selected this one here.
Now you get to this node and you 
realize it's not a leaf node.
It's not a terminal game, so you cannot 
resolve the final resolution. The next
step is expansion. You will then run this 
board state through the policy network.
Note that this is the AI's move.
The AI is making this move.
When we expand this tree, we're now thinking 
about what the human, or any opponent, might do.
When we evaluate the node here, we're going to 
evaluate it from the perspective of this player.
This node has possible actions that we could 
take, and we expand the leaf nodes here.
For each of these nodes that we could arrive 
at, we're going to now check how good they are.
From here, the human could 
play here, here, or here.
We're going to store Vθ for each of these, 
so Vθ of Node₁’, Vθ of Node₁’’, and so on.
We're basically using our neural network 
to make an intuitive guess of how good this
board is from the perspective of this player.
Fortunately, because it's a zero-sum game,
it's easy to deduce that the value for 
this player at this step is just one minus
the value from the other perspective.
It's easy to flip the search process
depending on which player you're at. This is the 
expansion step. You've taken a non-leaf node,
expanded it, and evaluated the value.
This is essentially a quick guess
as to whether I'm going to win or 
not if I were to play to the end.
You can almost think about Vθ as a 
shortcut for searching to the end
of the tree, for any given simulation.
This is essentially the evaluation step.
We're evaluating the quality 
of each of these boards.
In the original AlphaGo Lee, they 
did something kind of interesting.
They took this value and averaged it 
with the value of a real Go playout.
They actually played a real game 
from here all the way to the end.
I'm just going to draw this 
squiggly line to indicate some path.
They play this all the way to 
Tromp-Taylor resolution of a full board.
So this is a zero or one.
They took this value and
averaged it with this one here.
The formula was α x Vθ of some node
plus (1 - α) of a truly randomly sampled playout.
You might be wondering how they play this out.
It would be very costly to do another search 
on this playout, almost like a tree within a
tree. They don't do this. Instead, they just take 
the policy network and play it against itself.
They use it as both players, and they 
just play it all the way to the end.
This helps ground the estimates here in 
reality because you can get a single-sample
estimate of whether you win or not.
In the endgame, where the board is
almost resolved, this becomes quite useful 
because the play according to the policy will
most likely decide a pretty reasonable guess 
of the game, so you're not facing a problem
where this becomes untethered from reality.
It turns out this is totally unnecessary.
In all subsequent papers after 
AlphaGo Lee, they got rid of it.
In my implementation, I did the same, and it 
speeds things up a lot because you don't have to
roll these games out on every single simulation.
Okay, just to reinforce my own
understanding and re-explain it.
For the audience, by the way, in case it's not
obvious, the P there in the select, that is the 
probability coming from the network in this case.
Correct. The policy network here.
Fundamentally, a simulation, just think of it as
rolling out one more node in the search process.
Almost. A simulation is easy to think about
when the whole tree already exists.
You just walk down the tree using the
PUCT selection criterion, and then you keep going.
In AlphaGo, the data structure is such that we
begin with a tree that basically only has depth 
one—its only children—and you want to iteratively
build out the tree as you're also selecting 
actions down the tree. That's the core thing
here. Because Go is such a combinatorially 
complex game, you cannot afford to build
the tree in advance and then search it.
You must search while building the tree.
Let me finish up with the last 
step, which is the backup.
Once you've scored these things, the Q value 
assigned to the node here for taking this action
is just the average across your evaluated values.
You take a running mean over all the simulations
you've taken, averaging the 
values of the children nodes.
That's the backup step, and once you evaluate 
this, you can recursively go back up.
If you know the action value of this node, you can 
then take the average on its parent, and so on.
You have this four-step process: 1) You're 
choosing the best action you know of so far.
2) You may run into a node you haven't been to 
before, so you need to grow the tree a bit. 3)
You run it through the network to guess whether 
you're going to win or not. 4) You walk all the
way back up to the root node to update 
your values on what the best moves are.
As you do this iteratively, because you're 
always selecting according to this criterion,
you're always going to be selecting the 
best action you think at any given branch.
The final visit counts of how often 
you chose these things will reflect
your correct policy distribution as 
induced through this search process.
The visit count we stored in the node earlier 
actually becomes the vote for which action
we should finally select.
As a test of understanding,
it's worth thinking a little bit about 
whether we could make this even simpler.
Could we actually maybe even get rid of 
this one and still make the thing work?
Recall that when you do an expansion and then 
an evaluation at this node, you are checking the
win probability of each of the child nodes.
So if this one is one and these are zero,
you do know something about which 
action might be better to take.
Why would you still need this, right?
Why not just normalize the values into
a distribution and call that your 
policy distribution? This is fine.
You can do this, and it probably does work.
But in practice, having a single forward pass
that gives you a pretty good guess is how the 
breadth is pruned out. There is a duality here.
It would be weird if the policy recommended 
an action that disagreed with the value.
If a policy said this was very high probability, 
but this one said it was a low value,
then there's something fundamentally wrong 
between your policy head and your value head.
They are linked, and you probably could get rid 
of this if you came up with a different way to
recover it from just the value evaluations.
Right. Just to make sure I understand, the
reason you don't do that is so you don't have to 
do 360 independent forward passes to say, "Okay,
here's the value of everything. Let's argmax 
over it." Instead, you can just do one forward
pass and get the probabilities of all of them.
You can usually batch these somewhat efficiently,
so it probably isn't a huge 
computational burden in practice.
But yes, you would have to pass up to 361 
boards into a single mini-batch update to
evaluate all the values, then normalize them.
There's actually a more important reason why
we still do this, which is how Monte Carlo 
tree search is used to feed back on itself
and recursively improve its own 
predictions and search capabilities.
That's where having this as an explicit entity 
you're modeling, rather than an implicit
normalization over your value, is a good idea.
Makes sense. Okay.
We talked about the simulations. What you end up 
with as you roll out the number of simulations
is a tree that looks like… I'm drawing 
a very low-dimensional version of this.
Of course, in the real game, 
it's much more high-dimensional.
You'll end up with a tree structure that has a 
lot of leaves that terminate and are not visited
again because their value is deemed to be too low.
But along one path, there will be a set of actions
with very high visit counts that gravitate towards 
that one set of decisions as you increase N.
This is the mental picture of what the 
tree in Monte Carlo tree search looks like.
You should contrast this with an exhaustive tree 
like in tic-tac-toe, where there are nine actions,
then eight, then seven and six, and 
so it's a nine-factorial-sized tree.
The Monte Carlo tree search in Go is very sparse.
It only considers the paths that 
you've expanded children nodes on.
Now that we have the search algorithm that applies 
the value function as well as the policy function,
we can talk about how the Monte Carlo tree 
search algorithm can act as an improvement
operator on top of these guys here.
Twenty years ago, Jane Street's data
center fit in the corner of an office.
Ron Minsky, who co-leads the tech group
there, told me about how it all got started.
One of our compute clusters we called The Hive,
and I remember the first version of the Hive 
was literally like six Dell boxes stacked on
top of each other at the end of the row.
And the trading systems themselves we also
had there because we actually wanted the ability 
to make sure we could turn the damn thing off.
I mean there were ups and downs, like 
literally at some point you know,
one of the people who was cleaning the office 
unplugged one of the trading systems in the
middle of the day as they were vacuuming.
So, you know, in the end it is in fact
better to have it all in a data center.
Jane Street's data centers have come a long
way since those sixth Dells, and I got to tour 
one of them in Texas with Ron and Dan Pontecorvo,
who leads Jane Street's physical engineering team.
You know, these cabinets, these GB300 cabinets,
consume at peak about 140 kW each. Compare that 
to traditional air cooled, you're talking about
10 to 40 KW? It's a lot more.
We got deep into the details
of running one of these data centers.
Things that I had never considered before.
It's filled with a liquid, 
a mix of deionized water
and propylene glycol, 25% of propylene glycol.
That’s to inhibit any bacteria or algae growth.
I don't love the world where we have to 
worry about bacteria growing in our servers.
I got to see way more of what actually happens 
in a data center than I've ever seen before.
Jane Street was willing to literally pull up the 
floorboards and take out the racks and take me to
the back where all the chillers are.
You can check all of this out at
JaneStreet.com/dwarkesh where 
we posted the full tour.
We now talk about the RL part of how this 
thing gets stronger by playing itself.
Let's say we play a game. You make a 
move. The AI will compute the search,
and this is the visit count distribution.
Let's say this is your initial policy
recommendation at this node.
After MCTS, it gets more
confident about one of these actions.
Maybe the distribution looks a bit more
peaky like this based on the search.
Of course, you can tune the search
process so that it ends up more diffuse, 
but that's probably not a good idea.
MCTS should get more confident 
about specific actions over others.
It might place a lot of weight on other 
actions initially, and then as you increase
the number of sims, it should converge to a 
very peaky distribution. Let's call this π.
Let's wrap this in an MCTS operator of (a | s).
After applying the MCTS process, your policy
recommended distribution looks like this.
It’s a bit more peaky than the previous one.
Then you take the argmax, or 
maybe you just sample from this.
It doesn't have to be the argmax.
You make your move, throw away the
tree, and begin anew on the next move.
Again, you compute a new distribution.
Initially, maybe your guess looks like 
this, and then you refine it through MCTS.
There should be one more X on the board, right?
I'm sorry, that's correct, yes. Something that
looks like this. On every move, you have 
your initial guess from your policy network.
Then the search process, which combines 
your policy network and your value network,
arrives at a more confident action 
that you take, and so on and so forth.
Then the game ends, and one 
person wins and one person loses.
The beauty of how AlphaGo trains itself is 
that it can actually take this final search
process—the outcome of the search process—and 
tell the policy network, "Hey, instead of having
MCTS do all this legwork to arrive here, why 
don't you just predict that from the get-go?
Why don't you not use this guess 
and predict this to begin with?"
If you have this guess to begin with in 
your policy network, then MCTS has to do
a lot less work to get things to work.
If we draw a test-time scaling plot…
Let's say this is the number of simulations.
At zero simulations, your implicit win
rate is here.
If you just take
this raw action, this is what your win rate is.
Let’s say we increase the number of sims, maybe
you have a win rate curve that looks like this.
When you search for 1,000 simulation steps,
that gets you to a policy here that 
gets you to here, which is great.
But if you were to distill this MCTS policy 
network back into your shoot-from-the-hip policy
network, then you could actually start here.
Let’s say this was zero, by distillation,
if you then spend another 1,000 sim 
steps, you actually get to here.
It's almost as if you could amortize the first 
1,000 steps into the policy network instead of
the search process, then you could begin at a 
much better starting point and get a much better
result for the number of sims that you play.
The sigmoid type nature of test-time scaling
as the number of simulations increases, 
the increase in win rate gets smaller.
Is that true even for the distilled network?
That is to say, is there some gain where we
start from the distilled network 
and get these early gains again,
or is that just inherent to the nature of MCTS?
To be honest, I actually don't know the test-time
scaling behavior of MCTS simulations.
I believe it might be quite sensitive to
how strong this one is in practice.
I'm just drawing a monotonically
increasing function that gets to one.
Don't pay too much attention to the
shape of the curve.
Just know that it's
monotonic with respect to sims.
The idea of MCTS is very brilliant.
We got something better by applying search.
Now, on our next iteration of updating this
network, we're going to train it to approximate 
the outcome of 1,000 steps of search.
Instead of starting here, we get to have 
our neural network start here, and then the
play gets stronger once we then apply another 
1,000 steps on top of it. You can keep going.
The training algorithm for AlphaGo is to basically 
take the games where you've applied the search on
every move that the policy encountered—whether you 
won or lost, and that's quite important—and just
train the model to imitate the search process.
There's an analogy to robotics,
which is the DAgger algorithm.
First I'm going to draw a
schematic of the states: S₀, S₁, S₂, S₃.
Let's say we took a series of actions in
an MDP to get a trajectory. These actions may be 
suboptimal. Maybe we lost at the end of this game.
There is a family of algorithms that 
basically take trajectories and relabel
the actions to better trajectories.
Maybe a better action here would
have been to take A₀’.
A better action here would
have been A₁’, and yet another one, A₂’, A₃’.
What MCTS is doing is saying: you played this
game where you eventually lost, but on every 
single action, I'm going to give you a strictly
better action that you should have taken instead.
It does not guarantee that you are going to win,
but it does guarantee that if you take these 
tuples as training data and retrain your policy
network to predict these actions instead of 
the original ones, you're going to do better.
This is very related to DAgger in 
robotics and imitation learning,
where you want to collect an intervention here.
Even if you're in a not-great state—for example,
a self-driving car that veers off the 
side of the road—there is still a valid
action that corrects you and brings you back.
Pedantic question, but is there a guarantee
that MCTS must be better than the policy?
For example, you could imagine early on in
training, because MCTS is informed by the value 
network, that when the value network hasn't been
well-trained on finished games, MCTS is 
worse than a randomly initialized policy.
Is it just a heuristic that MCTS is better 
than the policy, or is there some guarantee?
In practice, it is a heuristic.
It does work in practice, but let
me illustrate an example where MCTS can give you 
a worse distribution than your policy network.
This can often happen if your self-play 
algorithm has trained to a good point,
but then somehow it collapses because it's 
not trained on diverse data or something.
Let's say we have a board state where 
the policy recommendations are very good.
So Πθ (a | s) is great.
But somehow, maybe because we're
playing a lot of games where the bots just 
resign instead of playing all the way to
the Tromp-Taylor resolution, they forget 
how to evaluate those late-stage plans.
Like in the case we showed with the corner 
play, maybe 100% of our training data in
our replay buffer has lost examples of how to 
evaluate the value function at those states.
You might end up in a scenario where 
your terminal value is very bad.
If the terminal values of the leaves are 
not good, then this will propagate all the
way up and cause your PUCT selection 
criteria and your backups to be off.
And then you end up visiting a very 
different distribution than what
your policy initially recommended.
Also, if your number of sims is low,
you might have a variance issue 
where you just don't explore enough.
It's only guaranteed to converge 
when you take N to infinity.
Variance in your search process 
as well as inaccuracies in your
evaluation can definitely screw with the 
quality of your policy recommendation.
That's why it's not a guaranteed improvement.
That is why I suspect AlphaGo Lee had the playouts
to the end in their training algorithm, so that 
they could ground this thing in real playouts.
In practice, what you could also do is, 
for 10% of the games, prevent the bots from
resigning and just say, "Resolve it to the end."
That way you get some training data in your replay
buffer to really resolve those late-stage playouts 
that normal human players would not play to.
If you assume that the 
value functions are correct,
this is why MCTS gives you a better policy.
It's a very critical chain of assumptions.
Assuming this is accurate, your search 
process should give you a better
recommendation than your initial guess.
If you have a cold-started policy—if
you have an AlphaZero-type thing—really 
what's happening for the first few epochs
is that the policy is kind of useless.
What you're really just doing is, "Hey,
let's play full games, and once we have 
played full games, for the preceding moves,
we'll have labeled who won and who didn't win."
The loss for AlphaZero has two components:
how good is the policy relative to MCTS, and 
how good is the value prediction relative to
who actually won the game from this move.
You can think of this being applied to
every single action or every single move.
Really what's happening at the beginning of
AlphaZero training is just that we're trying 
to get the value function to actually predict
who will win the game if you find yourself in 
this state and you're this player. Functionally,
that's all that's happening. Later on, once that's 
well trained, now the policy is also improving.
Correct. There’s one trick 
I found to be pretty useful.
This is not a peer-reviewed claim, 
so take it with a grain of salt.
I found it useful in my own 
implementation to do the following.
You want to first make sure that this is good 
before you invest a lot of cycles doing MCTS.
It doesn't really make a lot of sense to 
do search on garbage value predictions.
You want to start at a good 
place where this works.
AlphaGo Lee does a very good thing where it just 
takes human games, you train on them, and it just
works. It totally works. You can also take an 
open-source Go bot, play it against itself,
generate data, and that also works.
If you have some offline dataset that
has realistic, good play, you can easily learn 
the late-stage value functions pretty well.
That's what you need to start the search process.
Sorry, can you just repeat that sentence one
more time?
Sure. It's
quite easy to evaluate a late-stage Go game.
When almost all the pieces are on the board,
it's almost like a decidable 
problem because there's lower and
lower uncertainty as to the depth of the tree.
So most games played to the end by reasonable
people will be good training data to train a good 
value function at terminal parts of the tree.
Then, as you play more games, 
the search will back up good
values into the intermediate nodes of the tree.
As you increase the amount of data, your value
head gets a good intuition of what is a healthy 
board state versus a not-healthy board state.
Those are much more subtle to judge in the 
mid-game than at the beginning or the end.
The most difficult part to score is not the 
beginning, because the beginning is obviously 0.5,
and at the end it's pretty obvious who's winning.
The hard part that you want to learn in the
value function is who is winning in the middle.
So this is actually very analogous to TD learning.
Yes. There's a beautiful connection to TD 
learning that we can talk about in a bit,
contrasting it with Monte Carlo tree search.
You first want to get good value functions,
and expert data can give you a quick shortcut.
I recommend for practitioners, just do that first
to initialize to a good starting point.
If you want to do the AlphaZero thing or
KataGo tabula rasa learning, what you can 
try to do is play random games on a small
board. Just take a random agent. If you play 
50,000 games, you'll actually learn a pretty
good value function as well.
On a 9x9 board, you can see
enough of the common patterns with random play.
If you train on both 9x9 and 19x19 data—and KataGo
proposed one of these architectures—there's 
pretty good transfer learning from the
value head evaluated at 9x9 to 19x19.
Right, because this, unlike other games,
has very much a sense… There's not 
like a new kind of piece introduced
when you increase the size or something.
If we take it to its limit and consider a
very tiny 4x4 Go board, if you play 50,000 
games, you're going to have a lot of end
states that look like human play.
It's just Tic-Tac-Toe at that point.
If you broaden this a little bit to 5x5 or 
9x9, it's not unrealistic to imagine that
purely random play will generate 
pretty reasonable-looking boards.
You can score those pretty easily.
That is what gives you the bootstrapping
to then improve your policy with search.
But it's very critical that MCTS has
accurate value estimates.
You need to ground the value.
Ultimately, MCTS will fall apart if you don't 
have a grounding function for the value.
I'd be curious how much compute you save by 
training the value and policy on the same network.
Because they share the same representations, 
how much more efficient is learning?
We've just talked about how they're 
making similar predictions, or they
should be in line with each other.
So I'd be curious if you're halving
the amount of compute you have to do 
by keeping them in the same network.
AlphaGo Lee, the original AlphaGo 
paper, had two separate networks.
In all subsequent papers, they merged them into 
two heads, and presumably this saves compute.
But answering that question in 
a rigorous scientific way is
a simple question that in practice, if you really 
want to chase the question down to its limit,
takes quite a bit of work to really resolve.
Intuitively, yes, they share a lot of
representations.
As we mentioned,
your policy network and your value network, 
when doing evaluations, should agree.
There really should be this 
consistency between them.
Tell me if this is the 
wrong way to think about it.
When I learn how an LLM works and 
how simple RLVR is as an algorithm,
I'm stunned by the kinds of things it can do.
It can learn how to build very complicated code
repositories simply from getting a yes or no.
Here, if you understand it more deeply, just
predicting MCTS, AlphaGo seems less impressive 
in retrospect the more you understand it.
You're putting in a lot of bias by telling it how 
it should titrate exploration as things go on.
You're building this very 
explicit tree search for it.
I don't know if you share that intuition 
where the more you understand it, the less
impressive the accomplishment in 2017 seems.
I personally disagree. I think they're
profound for different reasons.
I don't understand the LLM RL
enough to comment on it on your podcast.
But why is AlphaGo a profound accomplishment?
It's worth stepping back a little bit.
It is different from modern RL,
and we can talk a little bit about 
some of the algorithmic choices there.
I think the most profound thing here is that 
a 10-layer neural network pass, basically
10 steps of reasoning… Of course, the 
reasoning is not just one trail of thought.
It could be distributed representations and 
a lot of thoughts going on at the same time.
But by construction, let's say a 10-layer 
neural network can only do 10 sequential
steps of thinking. 10 steps of neural network 
parallelized distributed-representation thinking
is able to amortize and approximate to very high 
fidelity a nearly intractable search problem.
This was a breakthrough that I think most 
people don't even fully comprehend today,
how profound that accomplishment is.
This is what also girds AlphaFold,
for example, where you have a very, very 
difficult physical simulation process where
you would need to roll out as so many microscale 
simulations, and yet 10 steps of a somewhat small
neural network can somehow capture what feels 
like an NP-class problem into a single problem.
It actually makes me wonder if our 
understanding of problems like P=NP,
or these fundamental computational 
hardness problems, is incomplete.
Obviously, it's not a proof of P=NP, but 
there's something to it that is very disturbing,
where what felt like a very hard problem can 
fall to a very simple macroscopic solution.
That is a very interesting insight, that a lot of 
problems which are proven to be NP-hard—I don't
know if Go is proven to be NP-hard, but protein 
folding, et cetera—neural networks can solve.
They're NP-hard in the worst case, but we're 
usually not concerned about the worst case.
These problems usually have 
a lot of structure to them.
I think the question we should be asking ourselves 
is about how we've been formulating solutions to
NP-hard problems in worst-case complexity.
I wouldn't say this solves Go.
It doesn't give us an exact solution of the 
optimum, but in practice, it is extremely useful.
The same thing has been shown 
in AlphaTensor and AlphaFold.
Yes, there is a very hard problem that, in the 
worst case, seems intractable, and yet we're able
to make almost arbitrary amounts of progress.
In the limit, what might this look like?
If you want to simulate something very complex 
like weather, or predict the future—do we live
in a simulation or not—the computing resources 
you need to build a very complex simulation might
be much smaller than you think, based on our 
ability to amortize a lot of that computation
into the forward pass of a single network.
To me, AlphaGo was the first paper that
really showed this profound level of simulation 
being compressed into a small amount of compute.
I feel totally not qualified on the computational 
complexity or the math to comment on this, but I
wonder if there's an important role of chaos here.
What is the problem with weather,
and why does it take 10x the amount of 
resources to predict weather a day out,
and continually so for every additional day out?
It's because it's a chaotic system, so small
perturbations can totally change 
the final estimate as time goes on.
I guess you would expect that for 
Go and protein folding as well.
Here's an analogy to weather that might be 
relevant in Go. Here's our current board state.
Given what we know about both players, what 
is the exact board state in the future?
This is extremely sensitive to initial conditions.
A single stone placed here can disrupt
the entire prediction. This is hard. 
Intuitively, this is the chaotic problem.
Yet somehow, we can predict who's going to win.
This captures a lot of possibilities.
There's a more macroscopic quantity 
that we really care about, which is the
average or expectation or some sort of global 
macrostructure over a lot of possible futures.
In weather, it could be the same thing.
We don't exactly care what the
velocity of wind 6,000 feet above a 
specific latitude and longitude is.
We care where the hurricane 
is, and things like that.
In chaos, there's the classic Lorenz attractor.
If you start anywhere on the Lorenz attractor,
you don't know where you're going to end up, 
but you do know that the thing looks like this.
There's a kind of beauty here.
Sometimes we don't necessarily
care about the microscale things.
We actually care about the macroscopic
structure, and these things can be predictable.
Contrast that to something like a hash function,
which is also incredibly dependent on 
initial conditions but doesn't have a
macrostructure, or at least hopefully 
doesn't, if the algorithms work.
One would hope.
There's no equivalent of
a value function or broadly how the weather 
is going to be that's interesting there.
It's really just about what the board is going 
to look like exactly 100 moves from now exactly.
Intuitively, that seems correct. This is 
also out of my area of expertise, but I
find it interesting that the tools of cryptography 
and hashing have also not been able to prove
that you cannot come up with fast approximations.
You cannot come up with fast approximations.
If they were able to do that, then 
you could prove P is not equal to NP.
In fact, we know that there's 
structure in many cryptographic
protocols, obviously RSA cryptography.
There is structure, and that structure
is what quantum computers exploit to break them.
Reiner has a very interesting blog post, which
we've talked about in the episode, where he points 
out that if you look at a high level at what
cryptographic protocols look like and what neural 
networks look like, they're extremely similar.
You have sequential layers of 
jumbling information together.
There's a convergent evolution in the algorithms.
In cryptography, you want the final state to 
be incredibly sensitive to initial conditions,
so that it comes out looking 
jumbled if you change anything.
In neural networks, you similarly want everything 
to be dependent on all the information, because
you want to process all the information 
and consider how it relates to itself.
You have the maximum power of a 
neural network at the edge of chaos.
I think there are some research papers 
from Jascha Sohl-Dickstein on this.
There's something quite fundamental about 
chaos that's not just hopeless noise.
There's something useful in chaotic 
systems, at least at that boundary.
But this is just thinking 
about it as a philosophy.
I don't know the math well 
enough to comment on it.
We’ll talk about LLM RL in a little bit, 
because there’s some connection there,
but let's go back to MCTS. What is it doing? 
Crucially, it is not saying that we're going to
increase the probability of winning directly.
It's not going to upweight all actions that won
and downweight all actions that didn't win.
Importantly, what it is doing is saying:
for every action we took, we did a pretty 
exhaustive search on MCTS to see if we could
do better, and we're going to make every 
action that we took better by having the
policy network predict that outcome instead.
This is a very nice idea because you have one
supervision target for every single action.
The variance of your learning signal is very
low compared to the alternative naive RL thing.
Let's consider a very naive algorithm that looks
a lot more like modern LLM RL today, where we 
do something like take the winner of a self-play
game and encourage it to do more of that.
It's worth thinking a little bit about what
alternatives we could use to train 
self-play agents instead of MCTS.
We use a lot of LLM-style RL these days. Is that 
relevant? Could we do that instead? Suppose we
have a very naive algorithm where we 
take a league of agents of different
checkpoints and play them against each other.
For the games where a single player wins,
we reinforce those actions up and retrain 
the policy network to imitate those winners,
instead of the MCTS objective.
Let's say you have a chain of
actions that led to a win, and you have a matchup 
between two agents that are basically the same.
Assume policy a (Πₐ) and policy b (Πb) are 
evenly matched, so their true win rate is 50%.
Let's say you play 100 games, 
and each game lasts 300 moves.
You're doing some sort of evolution 
strategy or some way to perturb these
things to get them to do different things.
Or maybe you don't, and you just play them
against each other to see that occasionally one 
might have a better strategy than the other.
Let's say policy a wins 51 
games and policy b wins 49.
This is just due to random luck, 
or maybe you perturbed policy a
in some way that let it do this.
Just to have a very simple model,
let's pretend that for 50 of the 
games, they played exactly equally.
In the one game where a won, 
it played slightly differently.
It made one critical move that normally it 
would have done differently, but due to some
exploration or random noise, it happened to 
make a smarter move than it did previously.
So you have one true supervision signal for 
your policy network, and then you have 99 games
times 300 moves for which imitating those actions 
gives you exactly the same policy you had before.
The scale of your variance is actually very 
bad, because you only have one label out of
this enormous dataset of supervision actions where 
you want… Actually, let me clarify a little bit.
We're talking about how the good move, the 
out-of-distribution move, is a small fraction
of all the moves that are played across 
all the games on which you'd want to train.
This, of course, reminds me of how LLMs 
are trained with policy gradient methods.
When Karpathy was on the podcast, he called 
it like "sucking supervision through a straw".
It's interesting that this thing you're 
saying—which would be intractable and prevents
you from actually getting beyond a certain level 
in Go—is just by default how LLMs are trained?
Right. This is not to say it doesn't work.
If you imagine increasing the number of games
to millions of samples, you actually can 
get some meaningful supervision samples,
so long as you find a way to mask 
out the supervision from these guys.
This is where things start to get pretty related 
to RL in terms of advantage and baselines.
Let's look at the gradient variance 
of a very naive approach—I'm going
to call it gradient RL—that's 
basically the sum of rewards.
The sum of rewards is the return.
In our naive setup here,
we only have an indicator variable for 
the return, where either you won or lost.
In the case where you lost, your gradient is 
zero, so you don't train on those examples,
and when you won, you try to predict those things.
You can think about this setup as a 
special case of this general formula here.
The trouble here is that 
this is very high variance.
When you multiply these terms out to 
compute the variance of the gradient,
it's equal to the expectation of… Just for 
simplicity, we can pretend this is on average zero
or something if you're centering it at no signal.
The variance here basically means that you're
taking the square of this product term.
You end up with
a term that grows quadratically with T.
When you have a setup like this, this thing acts
as a coupling effect on top of these terms here.
Let's actually map this to an LLM case,
and we can answer why LLMs only do one-step 
RL instead of a multi-step RL scenario.
In LLMs, you have a decoder that might 
predict some words like "hello world".
In current LLM RL, they treat this entire sequence 
as a single action, aₜ, and big T is just one.
It's true that because of how transformers are 
formulated through the product of conditional
probabilities, the log probability of the 
whole sequence is equal to the sum of the
probabilities of individual tokens.
In this case, I would say something
like log(hel) + log(lo) + log(world).
This is true, and if this term were one,
then they would be the same thing.
However, when you're sampling, if you
have a reward term assigned to every specific 
token, now you have these interaction effects
between the cross-multiplication 
of these terms and these terms.
The problem becomes, how do you 
ascribe the credit associated with
every episode to all these different terms?
The thing I'm confused on is what it would
even look like to do it that way in LLMs, because 
you only get a reward at the end of the episode.
You could imagine a reward that says, "I'm going 
to give you some process supervision where you get
a reward for each of these actions on every step."
Okay, so you're saying if instead of doing it
that way… Well, the way you've written 
it, it would be a sum at the end anyway,
so they wouldn't have to be multiplied.
But you're saying instead of doing it
that way, you would just add up these 
process rewards at the end and then
treat that as one single reward signal?
Correct, for one single log prob action.
But isn't that how it's written to begin 
with anyways? The sum of the rewards?
The thing that's a little bit hidden here in 
the math is that we’re assuming that when you
decompose the problem into a multi-step problem, 
you're now introducing correlations between your
actions through the computation of this guy.
If you separate these things out,
this will magnify the variance of this one.
In the case where you don't separate it out,
if you just have T=1, you have a single estimate 
of log prob and a single estimate of reward. This
term still shows up. In LLMs, the naive REINFORCE 
estimator looks a bit like the return of the
single action… It looks kind of like this.
This is the basic form here, but it's still
a contributor to variance.
You want to make sure that,
similar to how in this case we were training 
on a lot of neutral labels, you're sort of
penalizing the labels that don't help and only 
rewarding the ones that actually make you better.
Intuitively, the analogy is: can we find 
a term in our training objective such that
it’s actually discouraged from doing this, or 
these don't have any effect on the gradient,
and this has an effect on the gradient?
If you applied that there, the only thing
you could do is eliminate 49 of the games.
At least the way you have it written there,
it would be 51 times…
Actually, the optimal case is to discard
all of these moves, and only get a gradient 
on that single move that you got better on.
How would you do that?
This is a pretty tricky problem in practice.
This is where advantage estimation 
happens in reinforcement learning.
You want to subtract a term from your multiplier.
Instead of an indicator function of one and zero,
you want something that behaves like a zero for 
all of these guys, and then a one for these ones.
You could do that if you can say, "Hey, 
I won this game, so this is slightly
above baseline performance."
Well, you won a lot of games.
But you don't know which ones let you win because 
they were truly better versus winning by accident.
How would you design a baseline 
where it's truly better?
This is where in RL people use things like 
TD learning to better approximate the quality
function, the Q that we mentioned earlier.
You can try to subtract that from your return.
Ideally, what you really want to do in RL 
is push up the actions that make you better
than average and push down the actions that 
make you worse than average. They call this
advantage. There are multiple ways to compute it.
I highly recommend John Schulman's "Generalized
Advantage Estimation" paper as a good treatment 
on how to think about various ways to compute it.
At the end of the day, you want to reduce 
variance by trying to make this smaller,
so it doesn't magnify the variance of this one.
That makes sense. This requires you to have
a very good estimate of what average 
performance from a state would look like.
This gets us back to the value function 
thing we were talking about earlier.
Keep in mind that this model-free RL setting 
is trying to solve a credit assignment problem
where you don't know which actions were 
actually good and which ones were bad.
Monte Carlo tree search is doing 
something very fundamentally different.
It's not trying to do credit assignment on wins.
It's trying to improve the label 
for any given action you took.
We can actually think about a completely different 
algorithm called neural fictitious self-play,
which was used to great effect in 
systems like AlphaStar and OpenAI's Dota.
Let me talk a little bit about how you can unify 
some of these RL ideas in the model-free setting
as well as the self-play setting.
What happens if you don't have
the ability to easily search a tree?
In Go, it's a perfectly observable game.
You can easily construct a pretty deep tree 
that completely captures the game state.
In a game like StarCraft where you don't 
have complete control over the binary,
it's a little bit hard to do this, and I'm 
not even sure if it's a deterministic game.
That makes this difficult from 
a data structures perspective.
What is done instead is that the 
basic idea of supervising your
actions with a better teacher is still there.
We’re going to talk a little bit about how neural
fictitious self-play works. It's the same idea. 
We're going to come up with better labels for
each of the actions we took, just like in MCTS.
But how do we derive the better labels? In MCTS,
we perform search. Assuming we have a good 
value function, the search will give us
a better result than our initial guess.
In a game where you can't easily simulate
a search process, what they do instead is 
train what's known as a best response policy.
You fix your opponent. Let's say you're currently 
training Πₐ against a strong opponent, Πb.
In StarCraft, maybe these are the Zergs 
and you're playing Protoss or something.
You fix your opponent, and you treat this 
as a classic model-free RL algorithm,
where your goal is just to beat this guy.
Here you use your standard TD-learning-style
tricks, or use PPO or any 
model-free RL algorithm to
try to hill climb to winning against this player.
You have a reward function where the return is one
if it wins against Πb, and zero otherwise.
This is no longer a self-play problem.
It's just a fixed opponent, and you're 
trying to maximize a score against them.
You have a fixed environment where all 
you care about is beating this guy.
Once you have a good policy that you train 
with your favorite model-free RL algorithm—PPO
or SAC or V-MPO—you now have a good policy 
that gives you a good label for what this
one should do when playing against that player.
When you train multiple best response policies,
you can basically distill the RL algorithms 
into the labels for a given opponent.
You might have a best response policy 
against πb, and then maybe you have
a league of opponents like πb, πc, πd.
You're going to take the best response
policy that you train against each of 
these fixed opponents, and supervise
them with the label that this one would provide.
This is almost like a proxy for your MCTS teacher.
Instead of an MCTS teacher, you use a model-free 
RL algorithm to find the best search action
that you could do to beat your opponent.
Then you're distilling the policy here
into what's known as a mixed strategy, 
where it's trying to average across all
possible opponents you could play against.
This is what gives you something that
can do no worse than an averagely 
selected opponent from the league.
This gets around the problem of having to derive 
a teaching signal from MCTS, but fundamentally
it's still about relabeling your states with 
better actions so that they improve your policy.
Just to make sure I understand, if you 
win a game against this other policy,
you reinforce all the actions on that trajectory.
Yes. Here you can use a number of algorithms like
PPO, V-MPO, or even Q-learning if you want.
The specific algorithm here is usually a
model-free thing because you don't have search, 
but there's an interesting connection between
MCTS and Q-learning that I want to bring up.
In MCTS, you have a tree, and through the
resolution of your value function at 
the approximate leaves of the tree,
you can back up through many sequences and 
obtain some sort of mean value estimate.
Your Q is derived from the 
average of a bunch of simulations.
In model-free algorithms, there's often 
a component of estimating a Q value.
Q values are often learned through 
TD learning, although in PPO,
the way they do advantage estimation is 
not necessarily through a Bellman backup.
In Q-learning, there's this very cool trick where 
Q(s,a) is backed up as r plus some discount factor
times the max over a, Q of your next step.
Intuitively, if you have an MDP and this is
terminal, it's saying that the best action 
you can take at this state is equal to the
reward you get for taking this action, plus 
the best that you can do at the next state.
There's a recursive, dynamic-programming 
property of MDPs, and you can train neural
networks to basically enforce this consistency.
You can say, "Well, once I know the Q value of
this action, I can then use that to compute 
something about the Q value for the next."
Earlier I was like, "Hey, 
why are we training policy?
Why don't we just train the value 
alone?" That is what this is.
This is an algorithm for recovering value 
estimates of intermediate steps when you
don't have the ability to do forward search.
You must collect a trajectory first of n steps
before you're able to do this trick.
But the intuition is kind of the same.
Knowing something about the Q value here can 
tell you something about the Q value here.
And indeed, you can recover 
a policy from a Q value.
You don't need to explicitly 
model the policy distribution.
You can recover the policy distribution 
by doing argmax over your Q values.
Q-learning, or approximate dynamic 
programming, propagates what you
know about the future Qs backward like this.
You can see that there's a similar structure
going on here, where in one case you're planning 
over trajectories your agent hasn't been to yet,
whereas in the other you're planning 
over trajectories your agent has visited.
Importantly, why was Q-learning a big deal?
It's because historically we just haven't had the
ability to do search on fairly high-dimensional 
problems like robotics or whatever.
For a long time, we made the assumption that if 
we can't model the dynamics with a world model,
we're going to instead just 
collect trajectories and then
plan with respect to the only number 
that really matters, which is reward.
This is very interesting. To unify this with 
our discussion of LLMs, you don't have Q values,
but you're doing this backwards learning 
where you find the trajectories that pass
some unit test in some coding environment 
and then reinforce those trajectories.
There's a huge difference between that 
and this forward approach with MCTS.
The reason it's much more preferable to do MCTS is 
because you can do it per move and make each move
better, rather than having to learn per trajectory 
and hope, as Karpathy said, to learn this…
Through a straw.
Right, you get supervision through a straw.
You basically just upgrade all the tokens in 
a trajectory that might or might not have been
relevant to getting the answer right.
The reason you can do this much more
sample-efficient, favorable thing with Go is 
that because MCTS works in Go, you basically
know that if you just do the search locally 
here—and this search is truncated at the end
by a value function that works even if you haven't 
unfolded your whole trajectory—you can just say,
"This is my new policy," and improve in 
a more iterative, local way, rather than
having to unfold all these trajectories.
There was some research from Google in
2023 or 2024 where they tried to 
apply tree structures to reasoning.
The jury is still out as to 
whether this can ever work.
We probably will see a revisiting of this 
idea of forward search in the future.
But there are two things that make MCTS very 
simple for Go. Value estimation is concrete.
You can determine it for real, and then you can 
use it to truncate depth, as you said. The breadth
is also determined. What's critical is that the 
action selection algorithm, where you iteratively
visit and grow the tree, is well suited for 
the size and depth of problem that Go is.
But for something like LLM reasoning, 
PUCT might not be a good enough heuristic.
It might be too greedy with local tokens, and 
it might only give you obvious thoughts that are
correct but don't really solve your final problem.
I would say the jury is probably still out
on what the final instantiation of 
reasoning for LLMs will look like.
I wouldn't rule out that this stuff 
could come back, but it's a bit hard.
Don't LLMs natively learn to do MCTS, where 
they'll try an approach and be like, "Oh,
that doesn't work. Let's back up. Let's 
try this other thing," and then go in the
direction that proves to be more fruitful?
Certainly, LLMs manage to do something that
looks like real human reasoning without 
having to do an explicit tree structure.
That being said, I think the idea of doing forward 
search and simulation to get a better sense of
what is valuable might make a comeback, even if 
not in exactly the same instantiation as AlphaGo.
Just to make sure I understand the crux of it: 
the breadth comes from the number of legal actions
being wider, and the depth comes from not being 
able to train a value function as easily, because…
Here's an example where LLMs break down.
The CPUCT rule involves the √N over 1+Nₐ.
In an LLM, you're most likely never going 
to sample the same child more than once.
If you have multiple steps of thinking, 
because language is so broad and open-ended, a
discrete set of actions is not really 
an appropriate choice for an LLM.
Even though they're discrete tokens, it's 
just such a large number that this type
of exploration heuristic is probably not the 
right thing to guide how to search down a tree.
I guess the crux comes down to the fact that 
in Go, you know that MCTS is almost certainly
better than your current policy, even though 
you haven't explored the end of any trajectory.
In normal reasoning for LLMs or robotics, 
there's no way to just locally evaluate
and improve your next move in a way that's 
independent of actually solving the problem.
"No way" is a strong word.
Lots of people have thought
about how to apply MCTS or its successors 
like MuZero to continuous control spaces,
and I'm sure very cool research work is 
still ongoing to try to crack that problem.
But yes, the apparent challenge right now is 
that most problems in much higher-dimensional
action spaces, or something combinatorially much 
bigger like language, don't seem as amenable
to the discrete action-selection heuristics 
and game-evaluation type stuff that Go has.
That's not to say the idea of 
thinking into the future along
multiple parallel tracks might not give you 
some information about which way to search.
If you think about mathematics, it often occupies 
more of a logical search procedure where you can
back up and see which paths seem good or not.
There's more of a rigid structure there,
whereas in a business negotiation it's less 
of a tree and maybe something a bit different.
We're now seated, so I can ask 
you some more questions about
AlphaGo and AI research more generally.
In 2021, Andy Jones had a paper called
"Scaling Scaling Laws with Board Games", and 
he basically anticipated inference compute,
or inference scaling, by showing that you can 
trade off test-time compute and training compute.
That is to say, you can spend more compute 
searching through MCTS, and if you do that,
you can get the equivalent performance to 
having spent more time training the model.
If you see this pattern, you might think that with 
LLMs you might do something similar in the future,
and in fact that's what ended up happening.
What is a fun exploration one could do now to
explore other axes of scaling in toy settings, 
which will be important to understanding what
AI development might be like in a few years?
Test-time scaling and reasoning, and how they
interact with model size, are quite profound when 
it comes to how much needs to actually be done as
explicit search versus how much can be packed 
into the forward pass of a neural network.
How does a forward pass of a neural network learn 
to do something that should be sequential and
recursive? That's quite interesting. The Andy 
Jones "Scaling Scaling Laws with Board Games"
paper is quite cool.
There was another
really nice result from that paper.
Not only can you predict scaling laws of
the LLM variety—where as you increase parameters 
you can decrease the amount of compute for search
or vice versa—but he also showed that you can 
predict how much compute is needed to solve a
larger version of the board game.
For example with Go, which can scale from
3x3 to an infinitely sized Go board, you 
might be able to revisit this question and
try to reproduce whether this shows up.
I actually started this project with the
motivation of asking: does the "Bitter Lesson", 
or our knowledge of scaling laws, allow us to
execute a lot better on a compute-optimal Go bot?
Can we build a strong Go bot without all of the
KataGo tricks, just by really focusing 
on the Bitter Lesson and scaling laws?
I have not been successful so far, but usually 
when you want scaling laws to work, you want
to be in the regime where the recipe already 
works and the data sets are good, rather than
trying to figure out how to do scaling while also 
trying to figure out what the right data sets are.
The scientific understanding component 
in research often follows a step where
you get something to work first.
Then you use that system to collect
data that then helps you build a mental model 
of how things work, such as scaling laws.
Usually, if you want to build a strong Go bot 
using scaling laws, you actually have to make
a strong Go bot first and then use the scaling 
laws to extrapolate a bit further into the future.
Say more. Just so I understand, are 
you saying scaling laws did not work?
There was no scaling-laws pattern 
that you could see in your Go bot?
A mistake I made initially, when I had some 
bugs around how MCTS labeling was working,
was that I would collect a bunch of data 
with an expert policy and then treat it
as a supervised learning problem and try to 
identify scaling laws with expert data sets.
You can indeed plot things that look like 
this, but if you're in a regime where
your policy is not working well, you might 
just be studying scaling laws on bad data.
One important implementation detail is that 
if you want to study a scaling-laws problem,
you have to have a problem for which the data 
is good, the architecture is good, and there
are no bugs, and then you solve it from there.
Ex ante, I wasn't able to apply scaling laws
to direct what to look at until I had the rest 
of the system working. This sounds obvious to
researchers. Of course you want a working, 
bug-free system before you study scaling.
But as advice for practitioners, on where 
I tripped up when I started this project,
you don't necessarily want to jump into 
the science of studying your man-made
artifact before your man-made artifact 
is interesting enough to be studied.
Speaking of compute, you can look at these charts 
of compute used to train the best AI model in the
world over time going back 10 years.
It's a very smooth line in log space
that grows exponentially year over year.
Except there's this huge aberration, and that
aberration is AlphaGo Zero, which was trained 
on way more compute than any other AI model at
the time. It was 3E23 FLOPS. It's comparable to a 
frontier LLM, orders of magnitude off, but still.
The question is, especially with 
you being able to get something
off the ground, did you train it on your own?
I got a donation from Prime
Intellect for about $10K.
I spent maybe the first $4K doing exploratory
research, and then about $3K on the final run.
Some of it remained for serving the model.
Is your sense that they did a bad job 
training it, if you can do it in $10K now?
The compute required to be the first 
to do something is always much larger
than the compute it takes to catch up.
It's the same story playing out in LLMs.
Once someone else has done it, you 
can use tricks like distillation.
You can use all sorts of crutches 
to bootstrap your way to success.
With my own bot that I've hosted online, 
I actually used best-response training
against the KataGo models to get 
a strong level of performance.
As of the time of recording, I'm validating 
whether I can do that first step, which is
to do the tabula rasa play.
Importantly for research,
you often want to start from a good init.
The simple thing I did first was train
best-response agents against KataGo.
The AlphaZero team did not have any
policy they could train against, because they 
were trying to do everything tabula rasa.
Being the first to do it means you're prioritizing 
getting the thing working rather than the most
compute-efficient possible implementation.
This plays out in robotics as well.
If you look at the frontier of large models 
trained for robotics, the scatter plot is
all over the place, and there isn't a very 
clean line the way there is for frontier LLMs.
That's because the folks training 
these models often are not at the
scale where every FLOP counts and they need 
to squeeze out the performance of every single
FLOP as the deciding factor in pre-training.
Instead, their focus is more, "We want a certain
capability to show up, so we optimize the training 
setup to make it easy to derive that capability."
Once you have that capability… Invariably if you 
scale up the compute, you are forced to make it
compute-efficient, because this is hundreds 
of millions of dollars we're talking about.
But in the past, when compute for experiments was 
more plentiful—or not accounted for in a way the
researcher was really responsible for—you end up 
with people optimizing for things besides being on
the compute-optimal Pareto frontier.
I see, like speed or something?
Yeah, time to result, or just getting it to work.
The first AlphaGo probably had lots of compute,
and they didn't need to worry too much about 
making it the most compute-optimal thing.
How much of the improvements to compute 
efficiency are methods that did not exist
as of 2017 versus things which they 
could have done in 2017 but didn't?
Great question. Going into this project, 
I knew in the back of my mind that things
always get easier to do over time.
I wanted to see where Go was at,
given that there hadn't seemed to be any major 
open-source strong bot after KataGo in 2020.
Reading the KataGo paper, there 
were a lot of clever ideas.
I was thinking, "Let’s see if the Bitter Lesson 
had happened, where a lot of these tricks just go
away because Nvidia made faster GPUs.
Roughly, where are we on that?"
Again, this is not a peer-reviewed claim.
It's just my preliminary vibe guess based on
what I’ve seen with my own experiments.
It seems like architecture choices
don't matter that much.
Transformer versus ResNet…
We're at the speed of GPU where the size of the 
model is not so big that this really matters.
You can simplify the setup quite a lot.
Instead of doing a distributed asynchronous
RL setup with replay buffers and pushers and 
collectors, you can do a dumb synchronous thing
where you collect, train a supervised 
learning model, and then collect again.
There are opportunities to 
simplify infrastructure.
Nvidia GPUs have indeed gotten faster.
Whereas KataGo was trained on V100s,
you can train on half the number of 
desktop Blackwell GPUs and it still works.
Some of the auxiliary supervision objectives 
that KataGo developed aren't really necessary
if you have a strong initialization.
If you're initializing best-response
training against KataGo itself, your own model 
needs none of the tricks that KataGo needs.
The core thing is getting as quickly 
as possible to some strong opponents.
That matters a lot more than the 
specific architectural innovations.
But there are still some nice compute multipliers.
I found that training on 9x9 boards was very 
nice for resolving endgame value functions.
If you can co-train that on an architecture 
that can transfer between 9x9 and 19x19,
you can really cut down the 
warm-start time to learn from scratch.
AlphaGo Zero's plot showed the first 30 
hours or so spent basically catching up
to the supervised learning baseline.
You can cut down that time a lot
by pre-training on a small board and then 
warm-starting that into your 19x19 board play.
There was some other stuff, like varying 
the number of sims between episodes.
This turns out to be not that sensitive.
You can fix it or increase it. It doesn't
matter too much. Anyway, it's nice from a 
scientific perspective, revisiting an old
paper and seeing what really matters.
This is a tangential question,
but why is it okay to have a buffer in AlphaGo?
Every time I talk to an AI researcher, they're
telling me about how bad it is to be off-policy.
But the way a naive implementation of AlphaGo Zero
would work is that most of the moves in a given 
backward step, or in a batch of backward steps,
would not be among the ones made by the most 
recently trained model. So why is that okay?
Great question. This gets into the 
fundamental off-policy versus on-policy
reinforcement learning questions.
As you'll recall, in MCTS, you take
actions that you took, and you relabel them 
to take different actions on the same states.
The off-policy part comes in where, what if 
you're relabeling states that your new policy
would never visit? What's the point? You're 
wasting capacity. In the extreme limit, imagine
the distribution of states in your training 
buffer are all states you would never visit.
Then you're supervising them to take good 
actions on states you would never achieve.
Therefore, your policy can get really bad.
This is where off-policy can really hurt AlphaGo.
However, if you interpret this from the DAgger 
perspective—which is basically a way to correct
yourself back to the optimal trajectory given 
some data—what you want in an algorithm like
this is to have mostly states that you would 
visit, but then a small or reasonable percentage
of states in this high-dimensional 
tube around your optimal trajectories.
Any of those states are given a supervision target 
to funnel you back into your optimal trajectory.
Maybe I can just draw real quickly here.
In a DAgger-style setup, your optimal
training data distribution is 
your optimal states and actions.
You want to be in this state, you want 
to be in this state, you want to be
in this state, and then you win here.
These are your optimal policy actions.
These are the things you 
definitely want to train on.
But to make it robust to disturbances, you 
want to make sure that if you happen to
drift off into some other states, 
you can funnel yourself back in.
But why isn't this a fully general 
argument for off-policy training?
This is actually why you want to 
do off-policy training sometimes.
You don't want a compounding error where, if 
you make a mistake, you don't have the data
for how to return to your optimal distribution.
Optimal control doesn't really say too much about
how to not accidentally get here, because it's 
making the assumption that once you learn the
policy, you're going to get to here.
But in applications like robotics,
a gust of wind blows you slightly 
off and now you need to correct.
Or the friction on one of your tires is slightly 
lower than the other wheel, and now your car is
drifting and you have to correct it.
These kinds of things often happen in
more real environments.
There was a funny quote about chess and Go.
The problem with Go and chess is that the 
other player is always trying to do some shit.
Things can drift off, and you always want to be 
able to correct back to your winning condition.
Your replay buffer really should have the 
states your policy would visit, plus some
distribution of states you might drift to, and 
then how to return back to your optimal states.
Now, if you take this to the extreme and 
you say, "We don't have any of this data,
and we're just going to be labeling with MCTS 
states that are so far away from our optimal
behavior, like this bag of states over here…" 
Well, now each of them gets an MCTS label,
and your policy learns how to take the best 
possible action here, but you never get here.
You're training your model on 
states you would never reach.
This is where off-policy can really hurt.
As part of this project, I did try an experiment
where I took a bunch of trajectories, and to 
try to saturate the GPU as much as possible,
I took random states from the dataset 
and reran MCTS on just those states.
Instead of playing a whole game where I'm doing 
MCTS on every move, I ignore the causality of
moves and pick random board states, and 
I label those with my current network.
I might revisit old states I've labeled before 
and relabel them again with my current network.
In practice, this actually does work.
You can take some states that are reasonable and
constantly be relabeling them while training.
This starts to converge on a very
robotics-like setup, which is very common.
You have your dataset of trajectories, and then
you have something like a replay buffer pusher. 
These are off-policy offline trajectories. Your
replay buffer pusher pushes transition tuples to 
the replay buffer, and then you have a job that's
continuously replanning what the best action you 
should have done instead of the action you took.
In robotics, it's very 
common to minimize TD error.
Your Bellman updater is constantly pulling 
things from here and trying to satisfy Q(s,a).
From here you have your trainer, which 
is trying to fit the Q to the Qᵗᵃʳᵍᵉᵗ.
You can think about this as a sort of planner.
You visit old states you've been to,
and you take your current model and rethink.
What could I have done better if I visited this?
This is how off-policy robotic 
learning systems are usually trained.
These days there's a simpler recipe, but in the 
Google QT-Opt days, we did things like this.
So what is the trainer?
The trainer is where you try
to minimize Q(s,a) and Qᵗᵃʳᵍᵉᵗ.
Can you explain the whole setup
again at a high level?
You have your off-policy
data that came from various policies.
You're constantly pushing transitions
you saw before to a replay buffer.
Then you've got this thing called
a Bellman updater, which basically replans: 
instead of this action, what action should I
have taken at s to have a better value?
The way you enforce that is by trying
to minimize the TD error. Given this, you 
have s’. You compute Q(s’,a) and you find
the action that should go with s’ that 
makes this Q value as high as possible.
Then you add that to the reward here, 
and that gives you your actual target.
For this current s and a, your Qᵗᵃʳᵍᵉᵗ is this.
Now you send back the Qᵗᵃʳᵍᵉᵗ to this transition.
With this tuple, you pair a Qᵗᵃʳᵍᵉᵗ.
Then on the trainer side, you simply
use supervised learning and minimize your 
current network's Q(s, a) with its target.
Got it. So in the background you're 
basically thinking through how valuable
all these actions actually were.
Yeah. In a more optimal policy
where you're trying to maximize this, 
what is the Qᵗᵃʳᵍᵉᵗ of this transition?
It's sort of like daydreaming.
Exactly. You can think about
it like you're going back in hindsight.
Given what I've seen in the historical buffer,
was there a better action I could have taken?
The connection to Go here—I tried and it was
moderately successful, but too complex to open 
source—was replacing this with an MCTS relabeler.
Instead of doing this target network 
computation, you run MCTS on your transition.
In this case, you have your state, your 
action, and whether you won or not at the game.
Actually, you can just toss those two.
You don’t care about these ones.
You just take your state and plan MCTS to get 
your best policy, Π, on your current network.
Not the network that took this action, 
but your current best policy network.
You rerun your search 
offline on these transitions.
If these are transitions your policy 
can get to, this actually acts as
a very nice stabilizing effect.
Another benefit is that you can
fully saturate your GPU better because you're not 
blocking on the Go game to give you board states.
You simply search across all board 
states at any depth in parallel.
Here the trainer would just predict 
the MCTS label as best as possible.
This kind of works, and it is quite relevant 
in robotics where you have a lot of offline
data and you can't simulate things like MCTS.
But in practice, it does run into the problem
that if the current model is looking at states 
it would never reach, it's wasting capacity.
You have to be a little bit careful here.
Much of RL has converged to a much more
on-policy setup where they don't really 
try to directly train on off-policy data.
At best, they use off-policy 
data as a way to reduce variance,
but not to directly influence the objective.
Why have they converged to that?
It's just more stable. You might use the 
off-policy Q as a way to do advantage
computation, like Q minus the sum of Q.
If there are N actions, this is your value,
and these are your current Q values.
Your advantage for that action is
the average value minus your current one.
People can try to estimate Q in an off-policy
way and then just use advantage here.
If there's a problem in these dynamics,
it doesn't blow up your loss as much.
In robotics, there's a convergence
toward using off-policy data to shape your 
rewards but not actually be directly here.
I'm reminded of our earlier conversation 
about why MCTS is so favorable compared to
the REINFORCE or policy gradient thing LLMs do.
This might be totally wrong, but I wrote a blog
post a few months ago about how policy gradient 
RL is even more inefficient than you might think.
The inefficiency one thinks about naively is the 
fact that you have to roll out a whole trajectory
to get any learning signal at all.
As these trajectories become longer
and longer—as an agent has to do two 
days' worth of work to figure out if
it did a project correctly, rather than just 
completing the next word in a sentence—the
amount of information per FLOP decreases.
You have to unroll two days' worth of thinking to
see if you implemented a feature correctly, so the 
amount of samples per FLOP has been decreasing.
You're trying to maximize as 
you’re learning, bits per FLOP.
You can think of bits per FLOP as 
samples per FLOP times bits per sample.
What I mentioned a second ago is that the samples 
per FLOP go down as RL becomes more long-horizon.
But this kind of naive RL is also terrible 
from a bits-per-sample perspective,
at least compared to supervised learning.
Early on in training, let's say you have a
vocabulary size for an LLM that 
is 100K long, so there are 100K
possible tokens that one could answer.
You have a totally untrained model and
a prompt like, "The sky is…" With supervised 
learning, the model would have some probability
distribution over all the things it could say.
There's a label that says the term here is "blue,"
and it would learn through cross-entropy 
loss exactly how far its distribution
is from correctly saying "blue."
If you were doing this through RL,
the model would try, "The sky is halycon."
Nope, that's wrong. "The sky is told." Nope,
that's wrong. This is a totally untrained model.
You would have to do this on the order of 100K
times just to stumble on "blue" 
and get some learning signal.
In the supervised learning regime, you 
have your distribution of probabilities,
you get told that it's "blue," and 
you figure out how far off you were.
The amount you learn is a 
function of your pass rate.
The further away you are from 
"blue," the more you learn to
go toward "blue" using cross-entropy loss.
You can think of it as like your pass rate,
your prior probability of having said "blue."
As a function of that, in supervised learning,
through cross-entropy loss, you would 
learn negative log(p), p being pass rate,
bits, once you get this label.
Whereas in RL, if you're just randomly
guessing and seeing if it works, that's basically 
just the entropy of a binary random variable.
What's also tough here is that 
the distribution you're sampling
under is your policy's distribution.
If your policy has no chance of sampling
"blue," then you will never get a signal.
Exactly. That's modeled by the fact that your
probability of sampling "blue" is extremely low.
If you sample it, you do learn as much as
you would have in supervised learning.
In all other cases—99.99% of the time in
an untrained model—you're learning 
incredibly little from seeing that
"halycon" or "told" is not the correct word.
That's what happens most of the time. You
learn very little. If you put your pass rate 
on the X-axis and the bits you're learning
from a sample on the Y-axis, with 0%, 50%, and 
100%, so at the end of training you're here.
If you have supervised learning, the negative 
log pass rate would look something like this.
The entropy of a binary random 
variable would look like this,
depending on whether you're doing nats or bits.
If you do bits it's one here at the peak.
This is like a coin flip.
You learn the most from a coin
flip. This is supervised learning. This is RL. 
However, the problem is you spend most of training
in this regime, in the low pass rate regime.
How fast you're learning is a function of how
many bits per sample you're getting, and 
you're getting very little signal here.
If you chart the pass rate on a log scale—where 
at the beginning of training with a vocab size of
100K the pass rate is 1/100,000, then 1/10,000, 
1/1,000, 1/100—what this graph looks like here,
supervised learning would look like 
this, and then RL would look like that.
And arguably you spend all your time here, 
potentially never even getting a single success.
It's a depressing plot in the 
sense that once you're here,
it's not at all obvious how you get to there.
Once you're here you have something,
but you actually spend all the 
time here in many RL problems.
There's a question of how you initialize 
so you're at least at a non-zero pass rate.
One more thing I'd like to add about 
bits per sample that's very relevant to
any machine learning problem is that there's 
a connection to soft targets and distillation.
If you have access to the logits, not 
just the one-hot token answer, if you
have access to the soft targets the entropy of 
this distribution is far higher than the one-hot.
There's way more information in 
bits per sample in a soft label.
That's why distillation is 
so effective per sample.
It's giving you way more information per sample.
I wonder what the equation would be,
but obviously it's…
It would just be the
entropy of this distribution.
The entropy of this is zero.
The entropy of this is the entropy equation.
This is also why AlphaGo is quite beautiful.
In AlphaGo, you don't train the policy 
network to imitate the MCTS action.
You train it to imitate the MCTS distribution.
Both of these are valid, and if you wanted to do
a scientific experiment of how important this 
soft label is, dark knowledge distillation,
you can run an experiment where you 
retrain the policy network on the action
MCTS selected rather than the soft target.
Earlier I was stumbling around. Intuitively,
why is this ability to do iterative search—where 
you don't necessarily need to be able to win the
game in the beginning, you just need 
to be able to improve your current
policy—so powerful a capability in learning 
compared to how LLMs currently learn RL?
It's exactly this thing of considering 
your pass rate of the entire trajectory.
I just don't know a formal 
way to think about this.
Maybe you should help me out here.
Why is AlphaGo an elegant RL algorithm?
The major reason is that you never have 
to initialize at a zero percent success
rate and solve the exploration problem 
of how to get to a non-zero success rate.
This is what allows you to hill-climb 
this beautiful supervised learning signal.
If you look at the actual implementation 
of AlphaGo, every step of the way,
there's actually no TD error learning or 
dynamic programming, at least explicitly.
It's just supervised learning 
on a value classification as
well as a policy KL minimization.
It's just a supervised learning problem
on improved labels. The training is very stable. 
You can train as big of a network as you want.
You can retrain this on the data set. Everything 
will just go stably. The infrastructure
is very simple to implement as well.
You don't need a complex distributed
system to keep everything on policy.
At the end of the day, you're just saying,
"I have some improved labels.
Let's retrain my supervised
model on these targets."
You're always in this
beautiful regime where you're just trying to 
improve the policy, rather than escape local
minima where every signal is flat all around you.
One way to draw the curve is if you draw the win
rate of an MCTS policy versus the raw network… 
Let's say this dotted line is the raw network.
The MCTS policy looks like this.
Every step of the way,
this supervision signal is very clean.
You're never in a situation where the
MCTS is giving you no signal, unless your 
MCTS distribution converges to exactly
what your policy network predicts.
That's a great way to explain it.
Maybe we sit down and I ask some 
questions about automated research.
One thing I really wanted to talk 
to you about is that you did a bunch
of the research for this project through 
this automated LLM coding assistant loop.
There's an idea that if you fully automated AI 
research, you could have some sort of singularity.
Obviously we're not there yet, but to the 
extent that we have early indications of what
this process might look like, I'm curious about 
your observations about what the AI is good at,
what it's not good at, what you think about 
this scenario's likelihood eventually, and what
thoughts you have about this in general.
I think automated scientific research is
one of the most exciting skills that the 
frontier labs are developing right now.
It's important for everyone who's doing any kind 
of research to get a good intuition of what it
can do now and what it can't, and how the science 
process might work in the future once we have AIs
automating a lot of this investigation.
In brief, I mostly used Opus 4.6
and 4.7 while working on this.
What works is that the models can do
a very good job of hyperparameter optimization.
In the past, people would come up with a search
space of hyperparameters like 
learning rate, weight decay,
and maybe how many layers are in your network.
They would do a grid search or a Bayesian
hyperparameter optimization approach, 
and it would find some tuned parameters.
The really cool thing that automated 
coding can do now is search a much
more open-ended set of problems.
It can say, "I've identified that
the gradients are small in this 
layer, so let me change it up here.
Let me rewrite the code so the data loader 
has a new augmentation I came up with.
Let's try to find the best way to fit the 
constraints of the optimization problem."
You end up with this much more flexible, 
high-level, almost grad-student-like
ability to just grind a performance metric.
This can squeeze out quite a lot of performance.
On a fixed data set with a fixed time budget, 
you can improve perplexity by quite a lot on a
classification problem like LLMs or Go.
It is also fantastic now at basically
executing any experiment.
I have a Claude Skill that
I wrote called Experiment where I give it 
a description of what I want it to plot.
I just describe, "Here's the x-axis I want, 
here's the y-axis. Answer this question for
me." It'll run off and do all the experiments, 
compile the plot, make a report, and suggest
what might have caused it and so forth.
That's what works quite well today,
and I think we can expect these 
abilities to get better in the future.
But it's also useful to know what 
it's not doing so well today.
In the blog version of this tutorial, I have a 
plot of all the experiments I did grouped in a
tree, where every node represents a failed, 
successful, or mixed experimental result.
From there, it branches off into a child 
representing the follow-on experiment.
Occasionally, I'll rabbit-hole down a track like 
off-policy MCTS relabeling, do a few experiments,
and then realize it's probably not worth it.
So then I'll jump to a completely different track.
I call these things rows. What I find is that the 
current closed models the public can access today
don't seem to be that great at selecting what 
the next experiment should be in a given track.
They don't seem to be able to step back and 
do the lateral thinking of, "Wait a minute,
this track doesn't really make sense.
Let's go back to first principles and
think about what the bottleneck might 
be, or what we are trying to achieve."
Often I had to catch infra bugs myself by 
prompting the right question to Claude to
investigate what's causing the discrepancy, 
and then it'll answer the question.
With Mythos-class models or Mythos++ models coming 
online, maybe this just completely changes and
these problems fall to improved scaling.
But at the same time, I think there's a
rich opportunity to develop RL environments that 
might incentivize this kind of lateral thinking.
One of the motivations for setting up this 
Go environment was that Go captures a lot
of very interesting research problems, 
often overlapping with LLMs or robotics.
Yet it's very quick to verify.
The outer loop is ultimately:
does the agent do what I think it does?
You can check the outcome of a Go game
quite easily.
The inner loop
involves all this research engineering around 
distributed systems, predicting whether an
idea is going to work or not, and predicting 
the difference a particular modification to
your training algorithm might make.
I think there's a rich library of
subtasks and sub-environments that you can 
train an automated scientist to work on,
with Go as a sort of outer verification loop.
Once you acquire these skills, maybe you can
apply them to other domains 
like biosciences or robotics.
Or automating AI research.
Or automating AI research.
Which is the real crux, the scary/incredible 
thing of making AIs make future versions of AIs.
You're suggesting the outer loop here 
could just be your win rate against KataGo?
That's one of them. I think there are a lot 
of deeper questions that one could tackle.
For example, let's say you have an idea on how 
to improve a scaling-law compute multiplier.
The outcome isn't necessarily 
"achieve the best Go bot ever".
The outcome might just be, "Can I predict 
what the win rate of my Go bot will be?"
Or, "Can I predict the scaling-law 
plots that emerge from my idea?"
But then you can verify that you haven't 
reward-hacked anything by using a very
verifiable game like Go on the outer loop.
I think there are a couple of interesting
follow-on questions.
There are questions
on the inner loop and the outer loop.
On the inner loop, there's a question
of how locally verifiable any 
modification you might make is.
That is to say, would you know 
whether some idea you try out is
actually an improvement or a degradation?
Would you know if something isn't working
as a result of a bug, or is it the 
result of the idea itself being wrong?
Ilya was talking about how one of the things he 
thinks makes him a good researcher is that he has
a strong belief in what the correct idea is.
He's able to persevere through bugs and know
which things are bugs versus mistakes in the 
fundamental idea, based on his high-level belief
that "this idea should work, so therefore there 
has to be a bug", versus the other way around.
Why don't we start with that question?
How locally verifiable are things
which are good ideas?
As in the case of the
success story for deep learning, you can 
think about this as a decades-long idea
that took a lot of faith to get it to work.
This presents a very challenging long-horizon
RL problem where every step of the way you have 
a committee telling you that this is a bad idea,
and then ultimately you break through.
How do you design RL environments that
maybe give you some feedback earlier?
I think this is a very tough open
question that I don't have an answer to.
Ultimately, to play a very strong Go bot,
you probably did need to discover deep learning.
Having a challenging game that cannot be cheated
easily on the outer loop could be used as 
an outer-loop signal for something like
discovering the principles of deep learning.
Now, of course, to make it tractable—and this
is where research taste really matters—you 
have to come up with ways to initialize
your problem so that you don't try 
to solve a very intractable problem.
Maybe you can leverage LLMs as a 
universal grammar in the middle to
give you some sort of local feedback.
The fact that LLMs are a universal
grammar means that they can move 
at almost any level of the stack.
They can think very locally, as well as 
step back and think in very broad steps.
I think that's where a lot of the lateral 
thinking ability of humans comes from:
knowing when the track you're pursuing or 
the objective you're pursuing is not right,
and you should be asking a different question.
The other question is how stackable local
improvements are in the attempt to get 
to a better result on the outer loop.
I've heard rumors that at some AI labs, 
the thing that has gone wrong is that
people will individually pursue good ideas, 
but those don't end up stacking well, and so
the training run fails because of some weird 
interaction between two seemingly good ideas.
Having a single top-down vision of how 
things should work is very important.
Having worked at different AI labs and 
also played around with parallel agents
trying different ideas, what's your sense 
of how parallelizable AI innovation is?
Great question. I think the research taste 
for executing well on the Bitter Lesson is
that you need to know how much the Bitter 
Lesson can buy you and how much is too much
to ask for, at any given moment.
Of course, in the fullness of time
compute is the single most important 
determinant of how things work.
It's almost inevitable that as you scale 
up energy and compute and parameters,
intelligence will just fall out of that. That's 
super beautiful, super profound. No algorithmic
detail really matters beyond that.
But in the present day, we don't
have infinite compute and parameters and 
arbitrarily good initialization, so we
have to come up with heuristics that give us that.
These heuristics are probably somewhat redundant.
That's probably why you see this 
effect where a lot of these compute
multipliers don't necessarily stack.
They might have some correlated benefit.
And then three years down the line, 
when the Nvidia GPUs have gotten even
stronger, maybe they stack even less well.
Maybe at any given point in time, the benefit
of any given compute multiplier is transitory, 
which is what I suspected with the KataGo paper.
There were many algorithmic ideas applied, and 
then you can see that with modern Blackwell GPUs
and Ada-class GPUs—which are much better than the 
V100-grade GPUs that that paper used—some of these
algorithmic tricks to speed up convergence just 
don't matter so much compared to something else.
I think that's a matter of 
taste in the present time.
How about the outer loop? How 
verifiable for making AI smarter?
With Go, you do have this outer loop of win rate 
against the best open-source model out there.
Even there, as you were saying, there are other 
outer loops of whether you discovered a new
phenomenon, which is actually very hard to….
If you didn't know scaling laws were
important… When were Chinchilla 
or Kaplan scaling laws released?
2018.
So if you're back in 2015, there's
not an automated procedure one can easily imagine 
for knowing which paper is the scaling laws paper
versus which is just another random plot.
Even in the Go case, it's a hard-to-verify
outer loop, and the whole idea of an outer 
loop is to have some backstop on improvement,
let alone for general AGI, where of course we 
have a bunch of these benchmarks. But there's
a problem. We know the things we can measure, 
and we improve on the things we can measure.
We care about this broader ability 
to do economically useful work,
which is not super easy to measure, at 
least until you automate everything.
There's a question of how good 
the outer verification loop is
for AI self-improvement, and does that matter?
I'm going to give a non-rigorous argument,
but one that I intuitively believe.
DeepMind started with a focus on games.
They used games as their outer loop, and then 
their researchers learned from the experience
of solving games, and now they're working on LLMs.
Presumably, there was some positive transfer from
their time working on games and Atari and Go and 
StarCraft that now helps them make good LLMs.
I assume that there's positive transfer 
in some regard, whether it's coding or
general research ability or project management.
All these things probably help them do well.
If that's the case, why wouldn't it also 
be true for automated AI researchers?
They should be able to positively transfer 
experience tackling quick-to-verify,
quick-to-iterate environments to something 
more ambitious and economically useful,
like automating drug discovery.
I don't know. Isn't the issue…
Historically, until Gemini 3 or whatever a 
couple years ago, people were saying, "Look,
Google isn't catching up in LLMs because 
they're too tied to the old approach."
There are gains, but there are also ways in which 
it actively hinders you. It's not obvious to me.
The jury's still out. Let's say 
currently Google's doing quite well.
Who knows if the initialization of training on 
games is ultimately going to hobble their ability
to be the winner in the long term.
It's hard to say for sure.
Likewise, who knows if the seeming late 
start was really just them pre-training
for longer on how to scale up TPUs.
They invested all their tech tree in
getting TPUs to be good, which seemed not that 
useful in the short term but in the long term
it becomes… So it's even hard for humans to 
reason about what the optimal research strategy
should be, even with the data we have today.
Okay, we should let people know how they can
find out more about this project, whether to fork 
it themselves, or check out your blog post where
you do an excellent job explaining many 
of these ideas. Where do people go next?
My website is evjang.com. There's a blog post that 
links to an interactive version of this tutorial.
On my GitHub, which is the username ericjang, 
there's an AutoGo repo that people can fork
and reproduce the training results.
I also highly recommend people check
out this blog post, "As Rocks May Think."
We touched on some of the ideas in this
conversation, but it's this grander thesis of what 
happens when you have thinking as a primitive in—
Computer science.
Exactly. I highly recommend
people check out that blog post as well.
I encourage the audience to think about
the relationship between thinking and Go via 
MCTS and search, and how it relates to LLMs.
I think there's something quite profound there, 
and probably underexplored just because Go has
been relatively underexplored 
compared to the boom in LLMs.
It's not to say that I think we should 
have trees in our LLMs, but there is
some very interesting duality between them.
You can actually do a lot of research on Go,
MCTS, and reasoning with very small 
budgets. So that's very exciting.
Awesome, Eric. Thanks for doing this.
Thank you. It's an honor to be on the podcast.