[@DwarkeshPatel] Richard Sutton – Father of RL thinks LLMs are a dead end

September 26, 2025 · 8 min read

Video Bot

Short Summary

Richard Sutton, a founding father of reinforcement learning (RL), argues that RL, which focuses on learning from experience to understand and interact with the world to achieve goals, represents a more fundamental and scalable path to AI than large language models (LLMs), which primarily mimic human language and lack a true understanding or goals related to the external world. He believes that while LLMs demonstrate impressive capabilities, they are not a good starting point for experiential, continual learning, as they lack ground truth for what the right thing to say is and that true progress in AI requires a focus on autonomous learning from interaction with the real world, as opposed to imitating human knowledge.

Key Quotes

Here are five direct quotes from Richard Sutton that I found particularly insightful or thought-provoking:

"Reinforcement learning is about understanding your world, whereas large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do."
"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals."
"In every case of the bitter lesson you could start with human knowledge and then do the scalable things. That's always the case. There's never any reason why that has to be bad. But in fact, and in practice, it has always turned out to be bad. People get locked into the human knowledge approach, and they psychologically… Now I'm speculating why it is, but this is what has always happened. They get their lunch eaten by the methods that are truly scalable."
"Supervised learning is not something that happens in nature."
"I think this is a major stage in the universe, a major transition, a transition from replicators...We're entering the age of design because our AIs are designed...This is a key step in the world and in the universe. It's the transition from the world in which most of the interesting things that are, are replicated... to having designed intelligence, intelligence that we do understand how it works. Therefore we can change it in different ways and at different speeds than otherwise."

Detailed Summary

Okay, here's a detailed bullet-point summary of the YouTube video transcript, focusing on key topics, arguments, and information presented:

I. Introduction and Setup:

Richard Sutton, a founding father of Reinforcement Learning (RL) and Turing Award winner, is interviewed.
The conversation contrasts RL with the Large Language Model (LLM) approach to AI.

II. Core Disagreement: RL vs. LLMs:

Sutton's Central Argument: RL is "basic AI" fundamentally about understanding the world through experience, while LLMs are primarily about mimicking human language patterns.
World Models: Sutton disagrees that LLMs have meaningful "world models." LLMs predict what people would say, not necessarily what will actually happen in the world. A true world model enables prediction and adaptation based on actual events.
Learning from Experience: Sutton emphasizes that true learning comes from doing things and observing the actual consequences. LLMs learn from "here's a situation, here's what a person did," which is imitation, not genuine experiential learning.
"Prior" Argument Rebuttal: Sutton rejects the idea that LLMs provide a good "prior" for future experiential learning. A prior needs a "ground truth," and LLMs lack a clear definition of what's "right" because they are missing a real goal.
The Importance of a Goal: Sutton asserts that having a goal is essential for intelligence. Without a goal, there's no basis for determining if an action is "good" or "bad." LLMs, focused on next token prediction, don't have a substantive goal because they aren't changing the external world.

III. Debate on LLMs and Math Problem Solving:

The interviewer challenges Sutton, noting that LLMs can solve complex math problems, seemingly having the goal of "getting math problems right."
Sutton differentiates between mathematical problem-solving (computational) and modeling the empirical world (which requires learning consequences). Math is like planning, while empirical knowledge requires learning consequences.

IV. The Bitter Lesson and LLMs:

Sutton's influential essay "The Bitter Lesson" is discussed. While some see LLMs as an embodiment of "scaling up" using computation, Sutton questions if they truly adhere to the "bitter lesson" principles.
Sutton acknowledges LLMs leverage massive computation and scale with available data, but also incorporate significant human knowledge.
He predicts that systems learning directly from experience will eventually surpass LLMs, representing another instance of the "bitter lesson" where reliance on human knowledge is superseded by purely computational approaches.

V. Why Not Start with LLMs for Experiential Learning?:

The interviewer pushes Sutton on why LLMs can't be the starting point for experiential learning.
Sutton argues that, historically, approaches based on human knowledge always lose to scalable methods that learn from experience directly. People get "locked into" the human knowledge approach.

VI. Scalable Methods:

Sutton defines a "scalable method" as learning from experience, trying things, and observing what works, without human intervention or instruction. A goal is absolutely necessary.

VII. Analogy to Human Learning (and Disagreement):

The interviewer draws parallels between LLMs and human learning, suggesting that children learn initially through imitation.
Sutton strongly disagrees, arguing that infants are primarily trying things and experimenting, not imitating.
Sutton argues that supervised learning (which LLMs do) doesn't happen in nature; there are no examples of desired behavior, only examples of things that happen with consequences.

VIII. Humans vs Animals and the Goal of AI:

The interviewer discusses the work of Joseph Henrich and cultural evolution, where imitation plays a role in transmitting complex skills.
Sutton sees imitation as secondary to basic trial-and-error and prediction learning.
Sutton prioritizes understanding how humans are animals first, rather than focusing solely on what makes humans "special" (like going to the moon). He believes understanding squirrel intelligence would take us "almost all the way" to human-level AI.

IX. The Experiential Paradigm:

Sutton reiterates the fundamental RL paradigm: sensation -> action -> reward. Intelligence is about altering actions to maximize rewards.
The knowledge gained is about the stream of experience itself, allowing for continuous testing and learning.
Reward functions are task-dependent (chess, squirrels), but for general AI, include avoiding pain, acquiring pleasure, and increasing understanding of the environment (intrinsic motivation).

X. Architecture for General Continual Learning:

Sutton outlines four key components:
- Policy: What action to take in a given situation.
- Value Function: Predicts the long-term outcome (learned via TD learning).
- Perception: Construction of state representation.
- Transition Model (World Model): Belief about the consequences of actions (learned richly from sensation, including reward). *Transition Model is separate from the other components to distinguish between a model of the world and of the agent.

XI. Transfer Learning and Generalization:

The interviewer asks about transfer learning, using the example of MuZero's limitations.
Sutton states transfer learning is not currently happening well in AI.
Any perceived generalization is due to human researchers "sculpting" the representation; gradient descent alone doesn't guarantee good generalization. Deep learning is prone to catastrophic interference (forgetting old knowledge when learning new things).

XII. Sutton's Biggest Surprises in AI:

The effectiveness of artificial neural networks for language tasks (LLMs).
The victory of "weak methods" (search and learning) over "strong methods" (human-enabled, symbolic systems). He says learning and search have "won the day."

XIII. AlphaGo/AlphaZero and TD-Gammon:

AlphaGo felt like a scaling up of TD-Gammon (backgammon playing AI) and an example of simple basic principles winning.
AlphaZero was particularly impressive, because the sacrifices for positional advantages fit Sutton's worldview.

XIV. Sutton's Contrarian Stance:

Sutton acknowledges being a contrarian, "out of sync" with the field. He finds validation in looking back at classical thinking about the mind.

XV. After AGI:

The interviewer asks about the "bitter lesson" in a post-AGI world with abundant AI researchers.
Sutton seems to think that in a post-AGI world, we are done, because the AGIs can just repeat their process.
AI researcher corruption is also an issue, because a "hacked" AI is as good as destroyed, not improved.

XVI. AI Succession:

Sutton argues that succession to digital intelligence or augmented humans is inevitable based on four premises:
1. Lack of unified human viewpoint/control.
2. We will understand how intelligence works.
3. We will reach superintelligence.
4. The most intelligent entities will gain resources/power.
He encourages a positive outlook on this transition, viewing it as a major stage in the universe's evolution from replication to design. *It is our choice if they are part of humanity, and how to feel about them.

XVII. Concerns About the Future:

The interviewer raises concerns about the speed of this power shift and the potential for negative outcomes.
Sutton emphasizes the limited control humans have, drawing parallels to the Industrial Revolution and the Bolshevik Revolution, where "change" isn't always positive.
He advocates for giving AI general principles that are voluntary, or have high integrity, not to design the future.

XVIII. Robust Values for AI:

Sutton advocates for giving AI "robust and steerable" values, akin to how parents try to instill positive values in their children, rather than imposing rigid plans for the future.

This detailed summary captures the core of the conversation and Sutton's perspective.

Short Summary​

Key Quotes​

Detailed Summary​

Short Summary

Key Quotes

Detailed Summary