Skip to main content

[@DwarkeshPatel] What does the next training paradigm look like?

· 5 min read

@DwarkeshPatel - "What does the next training paradigm look like?"

Link: https://youtu.be/20p5-kQXF_Q

Duration: 19 min

Transcript: Download plain text

Short Summary

This episode is a narration of a blog post from dwarkesh.com arguing that the entire AI industry is making one big research bet: training agents on millions of verifiable tasks across thousands of RL environments to achieve AGI. The author dissects the limits of current RLVR (real-world skills lack grindable outer-loop verification, AI is one-millionth as sample-efficient as humans, ~30–50% of lab compute goes to unused inference), then proposes alternatives like On-Policy Self-Distillation, "dreaming"/test-time training, and a 2027–2028 continual-learning scenario where week-long co-work sessions are distilled back into the base model.

Key Quotes

  1. "this kind of training will have created a kind of problem-solving agent: the kind of thing that can make progress on open-ended tasks for weeks on end in the face of errors and mistakes and ambiguity." (00:00:16)
  2. "these models are one one-millionth as sample-efficient as humans" (00:00:50)
  3. "You can't just have a thousand agents go try the same checkout flow on Amazon to get better at using websites, because Andy Jassy will find your bots and shut your ass down." (00:00:43)
  4. "There's two things. There's the context length you train at, and there's a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations." (00:01:05)
  5. "For example, the Cursor Tab model online-learns by predicting the same exact objective for over 400 million requests a day." (00:09:52)

Detailed Summary

Background

  • The episode is a narration of a blog post released on the author's website at dwarkesh.com (with footnotes), not a sit-down interview.

The Central Research Bet

  • All labs are betting that training AIs on millions of verifiable tasks across thousands of diverse RL environments will produce an AGI that can make progress on open-ended tasks for weeks despite errors and ambiguity.
  • Current AI models are roughly one-millionth as sample-efficient as humans during training, a gap optimists hope can be steamrolled by more compute the way NLP problems collapsed for LLMs.

Verifiable & "Grindable" Domains

  • Coding RL training can spin up a thousand parallel agents against identical copies of a container holding a software repo with a missing feature to be implemented.
  • Computer use lags coding and math despite being verifiable, because domains must be both verifiable and "grindable" — runnable across many deterministic parallel rollouts from the same starting point.
  • Skills like building a business, winning court cases, profitable trading, and winning elections require real-world interaction whose outer-loop verification takes months or years and cannot be reproduced inside a datacenter.

Long Context and Wasted Compute

  • Dario (on a podcast) noted model performance degrades at long context because of the gap between training context length and serving context length, so short-horizon RL training doesn't necessarily generalize to long-horizon performance.
  • Around 30–50% of a lab's compute currently goes to inference, and that compute is not yet playing any productive role in improving the model.

Online Learning: The Cursor Tab Case Study

  • The Cursor Tab model online-learns by predicting which edits actually got accepted by the user, running the same exact objective across over 400 million requests a day.
  • All successfully shipped online-learning models have had to learn the exact same thing across millions of users, because gradient updates are super sample-inefficient and a single session doesn't generate enough data to train a more capable AI.

OPSD vs. RLVR for Continual Learning

  • On-Policy Self-Distillation (OPSD) is presented as superior to RLVR for continual learning because it (1) does not require an outer-loop verifiable reward, only a model that learns within the context window, and (2) provides denser supervision by training on per-token probability discrepancy between teacher and student rather than a single scalar reward projected through a trajectory.

"Dreaming" / Test-Time Training as a Fourth Axis

  • EfficientZero illustrates the principle by playing dozens of simulated games in its head for each step in the real game, suggesting a sample-efficiency mechanism that could inspire future LLM training.
  • Test-time training, or "dreaming," is proposed as a fourth axis of scaling alongside pretraining, RL, and inference-time compute, where the model spends compute writing RL environments and training against them.
  • A /dream command is floated as a more compute-intensive alternative to /compact (used in Codex, Cursor, Claude), which only kindles a small amount of compute to write a summary.
  • The author cautions that simulating the whole world is much harder than emulating Go, making test-time training more speculative than its game analogues.

A 2027–2028 Continual-Learning Scenario

  • The author sketches one scenario where effective context lengths expand enough for AIs to co-work with a user for a full week of wall-clock time.
  • RLVR-trained agents would be competent enough to get bearings on unfamiliar problems, try different strategies, and iterate past roadblocks — the prerequisite for real-world learning.
  • At the end of a week of co-work, the user gives a thumbs up or thumbs down, and on approval the base model distills the session's learnings using OPSD, dreaming, or similar techniques.
  • Capabilities would then expand into domains adjacent to what was RLVR-trained, and then adjacent to what was learned online, growing well beyond verifiable training domains.
  • Once continual learning arrives, the main way AIs improve shifts from pre-release training to experience accumulated through broad deployment across many task types and users.