Skip to main content

[@DwarkeshPatel] Building AlphaGo from scratch – Eric Jang

· 10 min read

@DwarkeshPatel - "Building AlphaGo from scratch – Eric Jang"

Link: https://youtu.be/X_ZVSPcZhtw

Duration: 157 min

Transcript: Download plain text

Short Summary

Eric Jang, VP of AI at 1X Technologies and former Google DeepMind Robotics researcher, explains AlphaGo's architecture and Monte Carlo Tree Search, covering how neural networks solve NP-hard Go problems with PUCT selection, value/policy networks, and training methodology. The episode explores compute efficiency gains (achieving AlphaGo-level results for ~$10K versus millions) and why MCTS struggles with open-ended LLM reasoning. Jang also discusses scaling laws, automated AI research using game environments as verification loops, and insights on research philosophy and lateral thinking.

Key Quotes

  1. "This is something on the order of 361³⁰⁰, which is far more than the number of atoms in the universe." (00:14:08)
  2. "This is what makes Go a beautiful game: you can lose the battle but win the war." (00:07:30)
  3. "AlphaGo was the first paper that really showed this profound level of simulation being compressed into a small amount of compute." (00:20:18)
  4. "The major reason is that you never have to initialize at a zero percent success rate and solve the exploration problem of how to get to a non-zero success rate." (00:31:18)

Detailed Summary

AlphaGo, Monte Carlo Tree Search, and the Future of Automated AI Research

Guest Background and Motivation

Eric Jang is VP of AI at 1X Technologies and previously served as senior research scientist at Google DeepMind Robotics. He was profoundly inspired by early AlphaGo breakthroughs in 2014-2016, witnessing how AI could tackle computational complexity classes that computer scientists had long understood to be intractable for exhaustive search.

  • Jang's research trajectory was shaped by watching AlphaGo demonstrate that neural networks could solve problems previously considered impossible to solve this century
  • His work at DeepMind Robotics involved transferring insights from game-based AI to physical robotics applications
  • At 1X Technologies, he applies these principles to developing humanoid robots for real-world deployment

The Intractable Nature of Go

A 19x19 Go board presents approximately 361^300 possible game outcomes—a number that far exceeds the estimated number of atoms in the observable universe. At any given position, players have roughly 361 legal moves with 250-300 total moves per game.

  • Computer scientists universally believed that Go was not tractable to solve using traditional search methods before 2100
  • The game tree's breadth (legal moves per position) and depth (total moves per game) both contribute to computational intractability
  • Earlier chess programs like Deep Blue relied on handcrafted evaluation functions and brute-force search that would fail catastrophically on Go's much larger branching factor

AlphaGo's Core Architectural Breakthrough

AlphaGo solved the Go search problem by using neural networks to address both the breadth and depth challenges of the game tree simultaneously. The system combined deep learning with search to achieve superhuman performance.

  • The policy network learns to predict likely moves, reducing the effective branching factor from 361 to a manageable set of promising candidates
  • The value network learns to evaluate board positions, reducing the need to search all the way to terminal game states
  • This dual-network approach transformed an NP-hard problem into something tractable with modern hardware

Monte Carlo Tree Search Architecture

MCTS performs four iterative steps in sequence: selection of the best action via PUCT criterion, expansion of the tree by adding child nodes, evaluation of board positions using the value network Vθ, and backup of updated values to the root node.

  • The PUCT selection formula is Q(s,a) + CPUCT × Pₐ × (√N / (1 + Nₐ)), where Nₐ starts at zero for all actions
  • When Nₐ equals zero, the exploration term dominates, initially biasing selection toward the highest-probability policy node predicted by the prior network
  • AlphaGo uses two neural network heads: a value network performing binary win/lose classification and a policy network outputting probabilities for all 361 board intersections
  • Visit counts accumulated during MCTS become the final policy distribution π, which becomes more "peaky" (confident) as simulation count increases
  • The MCTS tree is reinstantiated from scratch after each opponent move with a new root node and fresh search

AlphaGo Lee vs AlphaZero Training Approaches

AlphaGo Lee initialized its network with supervised learning on expert human play datasets before transitioning to self-play refinement, while AlphaZero learned entirely from scratch through self-play without any human game data.

  • AlphaGo's training relies on supervised learning using policy KL minimization and value classification on MCTS-generated labels, not TD error learning or dynamic programming
  • This approach is stable and scalable—networks can be made arbitrarily large without training instability
  • AlphaZero spent roughly the first 30 hours of training catching up to what the supervised learning baseline would have achieved before self-play refinement could make progress
  • Playing 50,000 random games on a 9x9 board is sufficient to learn a good value function due to common patterns visible at that board size, with good transfer learning to the full 19x19 board

The Credit Assignment Problem in Naive Self-Play RL

In naive self-play reinforcement learning with two evenly-matched policies playing 100 games of 300 moves, only 1 game out of 100 contains a meaningful supervision signal, yielding approximately 1 useful label out of ~30,000 total moves.

  • Karpathy compared policy gradient training in LLMs to "sucking supervision through a straw" due to sparse, high-variance reward signals
  • MCTS fundamentally avoids the credit assignment problem through exhaustive search-based supervision, providing low-variance supervision targets per action
  • The variance of gradient estimates in naive RL grows quadratically with the number of time steps T due to coupling effects between action-reward products
  • This quadratic variance growth makes naive RL extremely sample-inefficient for long-horizon tasks

Compute Efficiency: From Millions to Thousands

KataGo, an open-source Go bot project by David Wu from Jane Street released in 2020, achieved a 40x reduction in compute needed to train a strong Go bot tabula rasa compared to AlphaGo Zero.

  • A researcher received approximately $10K from Prime Intellect, spending roughly $4K on exploratory research and $3K on a final training run, achieving AlphaGo-level results for a few thousand dollars versus the original estimated millions
  • Andy Jones's 2021 paper "Scaling Scaling Laws with Board Games" demonstrated that test-time compute can trade off with training compute
  • AlphaGo Zero trained on approximately 3E23 FLOPS, comparable in compute to frontier LLMs of that era
  • The KataGo paper used V100-grade GPUs, but modern Blackwell and Ada GPUs are substantially stronger, reducing the relevance of some algorithmic convergence tricks discovered on older hardware

Why MCTS Struggles with Open-Ended Language Tasks

MCTS works exceptionally well for Go because value estimation is concrete (win or lose) and the PUCT exploration strategy is well-suited to the game's finite problem size. However, these assumptions break down for language tasks.

  • In LLMs, the CPUCT rule's √N/(1+Na) heuristic fails because language is too broad and open-ended compared to the discrete 361-move action space of Go
  • LLMs almost never sample the same child node more than once during tree search, violating the assumptions underlying PUCT's exploration bonus
  • The discrete action sets that make MCTS effective for games become inappropriate for the continuous, branching nature of text generation
  • Google had active research in 2023-2024 applying tree structures to LLM reasoning with inconclusive results
  • The fundamental challenge is that language tasks lack the clear terminal states and compact state representations that enable effective value estimation in board games

Neural Networks at the Edge of Chaos

Neural networks achieve maximum representational power at the "edge of chaos"—a critical regime between ordered and chaotic behavior where information propagation through layers is optimized.

  • A 10-layer neural network can amortize and approximate an intractable search problem with high fidelity, which Jang considers one of the most profound breakthroughs in deep learning
  • Initialization is key for successful training—researchers should always start projects as close to success as possible with well-chosen initial conditions
  • Cryptographic protocols and neural networks exhibit convergent evolution—both use sequential layers to jumble information and aim for maximum sensitivity to initial conditions
  • Mid-game scoring is the hardest part for value functions to learn; beginning positions converge toward 0.5 (random win probability) and end positions are obvious, making the middle game the critical skill that separates strong from weak players

Automated Research and Game-Based Verification Loops

Go serves as an ideal outer verification loop for AI research because game outcomes are easily and quickly verifiable—you immediately know who won a game. The inner loop involves distributed systems engineering and predicting whether proposed modifications will improve results.

  • DeepMind started with games (Atari, Go, StarCraft) as their outer loop for learning, then transferred that experience to working on LLMs
  • Jang used Opus 4.6 and 4.7 for automated research on the project, demonstrating the feasibility of AI-assisted AI development
  • Current closed AI models do not excel at selecting the next experiment in a given research track or performing lateral thinking to recognize when a track should be abandoned
  • Automated AI researchers could potentially transfer experience from quick-to-verify game environments to economically useful tasks like automating drug discovery
  • The challenge is that economically valuable tasks typically have slower verification loops than board games

Scaling Laws and the Centrality of Compute

Chinchilla and Kaplan scaling laws, released in 2018, established compute as the single most important determinant of AI capability. Intelligence emerges from scaling energy, compute, and parameters.

  • Compute multipliers and algorithmic tricks (such as those in the KataGo paper) do not stack well together, likely due to correlated benefits that overlap when combined
  • LLMs function as a universal grammar capable of operating at multiple levels of abstraction—from very local token prediction to very broad strategic thinking
  • The relationship between training compute and capability follows predictable power laws that allow researchers to estimate required resources for target performance levels
  • Frontier AI development continues to require massive compute investments, making efficiency improvements increasingly valuable

Research Philosophy and Lateral Thinking

Ilya Sutskever believes one of his strengths as a researcher is having strong conviction in correct ideas, enabling him to distinguish bugs from flawed fundamental concepts when experiments produce unexpected results.

  • RL environments could be developed to incentivize lateral thinking—the ability to recognize when pursuing a research track makes no sense and return to first principles
  • The ability to abandon a promising-seeming direction is distinct from creativity and represents a valuable metacognitive skill
  • Good research intuition comes from experiencing failure repeatedly and developing pattern recognition for promising versus dead-end approaches
  • Jang's GitHub username is ericjang and his AutoGo repo can be forked to reproduce automated Go training results for further experimentation

Key Technical Formulas and Parameters

The PUCT (Predictor + Upper Confidence Bound for Trees) algorithm balances exploitation of known good actions against exploration of uncertain ones through its confidence bounds mechanism.

  • The exploration term √N / (1 + Nₐ) decreases as actions are visited more frequently, naturally decaying exploration over time
  • CPUCT is a hyperparameter controlling the strength of the exploration bonus, typically set around 1.0-2.0 in practice
  • The prior probabilities Pₐ come from the policy network and encode the learned likely distribution of good moves
  • Q(s,a) represents the mean value estimate for taking action a from state s, updated during the backup phase of MCTS