[@DwarkeshPatel] Some thoughts on the Sutton interview
Link: https://youtu.be/u3HBJVjpXuw
Short Summary
This video summarizes Richard Sutton's "Bitter Lesson" as arguing that the most scalable AI techniques should leverage compute effectively, but current LLMs are inefficient due to reliance on human data and a lack of continual learning. While disagreeing with the strict dichotomy, the speaker believes Sutton's critiques highlight fundamental gaps in current LLMs, and future AGI systems will likely incorporate his principles for efficient, continual learning.
Key Quotes
Here are 5 quotes from the YouTube transcript that represent valuable insights:
-
"The bitter lesson says that you want to come up with techniques which most effectively and scalably leverage compute."
-
"Just because fossil fuels are not a renewable resource does not mean that our civilization ended up on a dead-end track by using them. In fact they were absolutely crucial. You simply couldn't have transitioned from the water wheels of 1800 to solar panels and fusion power plants. We had to use this cheap, convenient and plentiful intermediary to get to the next step."
-
"What planes are to birds, supervised learning might end up being to human cultural learning."
-
"But if we're not allowed to call their representations a “world model,” then we're defining the term “world model” by the process we think is necessary to build one, rather than by the obvious capabilities the concept implies."
-
"Evolution does meta-RL to make an RL agent. That agent can selectively do imitation learning. With LLMs, we’re going the opposite way. We first made a base model that does pure imitation learning. And we're hoping that we do enough RL on it to make a coherent agent with goals and self-awareness. Maybe this won't work!"
Detailed Summary
Here's a detailed summary of the YouTube video transcript, broken down into bullet points:
I. Introduction:
- The video is a reflection on the interviewer's understanding of Richard Sutton's perspective after reconsidering the interview.
- Acknowledges and apologizes for any misunderstandings of Sutton's views during the initial conversation.
II. Sutton's "Bitter Lesson" Argument (Steelman):
- Core Idea: The "Bitter Lesson" argues for techniques that most effectively and scalably leverage compute.
- Inefficient Compute Use: Current LLMs are inefficient because:
- Most compute is used during deployment (inference), where the model isn't learning.
- Training is sample inefficient, requiring vast amounts of data (equivalent to thousands of years of human experience).
- Learning from Human Data:
- LLMs primarily learn from human data (pretraining and human-furnished RL environments).
- Learning from a hard-to-scale resource like human data is not scalable.
- Lack of True World Model:
- LLMs don't build a true world model (how actions change the environment).
- They model "what a human would say next," leading to reliance on human-derived concepts. Example: An LLM trained only up to 1900 might not be able to deduce relativity.
- Inability to Learn On-the-Job:
- LLMs can't learn continuously "on-the-job."
- A new architecture is needed for continual learning, eliminating the need for a separate, sample-inefficient training phase.
III. Disagreements with Sutton's Position:
- Main point of divergence: The interviewer believes the distinctions Sutton draws between LLMs and "true intelligence" are not necessarily mutually exclusive or dichotomous.
- Imitation Learning and RL are Complementary: The speaker believes imitation learning and RL are continuous and synergistic.
- Models of Humans as Prior Knowledge: Models of humans can provide a useful prior knowledge base to aid in learning "true" world models.
- Continual Learning via Test-Time Fine-Tuning: Suggests future versions of test-time fine-tuning might replicate continual learning, drawing parallels to the emergence of in-context learning.
IV. Arguments Supporting Complementary Relationship between Imitation Learning and RL:
- LLMs as a Good Prior for RL: Poses the question of whether pre-trained LLMs can serve as a good prior for accumulating experiential learning (RL) to reach AGI.
- Fossil Fuels Analogy: Uses Ilya Sutskever's analogy of pretraining data as "fossil fuels," arguing that even non-renewable resources can be crucial stepping stones to future advancements. The transition from water wheels to solar panels required "fossil fuels" as an intermediate step.
- AlphaGo/AlphaZero Example: AlphaGo (trained on human games) was superhuman, even if AlphaZero (bootstrapped from scratch) was better. Human data isn't necessarily detrimental; at large scales, it might just be less helpful.
- Humanity's Knowledge Accumulation: Draws a parallel between LLM pretraining and the accumulation of knowledge over human history (language, legal systems, technology). Imitation plays a role in cultural learning even if it's not exactly token prediction.
- Imitation Learning as Short Horizon RL: Frames imitation learning as a form of RL with very short episodes. Predicting the next token can be seen as an agent making a conjecture and receiving a reward for accuracy.
- Using Imitation Learning to Improve Learning from Ground Truth: Argues that pre-trained models can be RLed on ground truth examinations such as IMO competitions and code development. This process needs a reasonable prior over human data.
- "World Model" Semantics: Doesn't believe the term "world model" should be defined by how we think it should be built, but rather by the actual capabilities the model possesses.
V. Arguments on Continual Learning:
- Sample Inefficiency of Current RL: Points out the low information gain per episode in current LLM RL, contrasting it with human and animal learning.
- Animals' Ability to Model the World: Highlights how animals model the world through observations, with an outer loop RL incentivizing that modeling. Mentions Richard's OaK architecture and the transition model.
- Shoehorning Continual Learning Atop LLMs: Speculates about potential ways to incorporate continual learning, such as making SFT a tool call for the model, where outer loop RL incentivizes the model to teach itself.
- In-Context Learning as a Precursor: Notes that in-context learning shows a resemblance to continual learning and suggests that extending context windows could unlock even greater flexibility.
VI. Conclusion:
- Evolution vs. Current AI Development: Contrasts evolution (meta-RL leading to imitation learning) with the current LLM approach (imitation learning as a base, hoping to develop agency through RL).
- Skepticism of First-Principles Arguments: Doesn't find arguments based on abstract principles (like LLMs lacking a "true world model") to be entirely convincing or accurate.
- Value of Sutton's Critique: Acknowledges that Sutton's critique highlights genuine gaps in current models (lack of continual learning, sample inefficiency, reliance on human data).
- Future Direction: Predicts that even if LLMs achieve AGI first, successor systems will likely be based on Sutton's vision, addressing those fundamental limitations.
