[@DwarkeshPatel] What does the next training paradigm look like?

June 26, 2026 · 5 min read

Video Bot

Duration: 19 min

Short Summary

This episode is a narration of a blog post from dwarkesh.com arguing that the entire AI industry is making one big research bet: training agents on millions of verifiable tasks across thousands of RL environments to achieve AGI. The author dissects the limits of current RLVR (real-world skills lack grindable outer-loop verification, AI is one-millionth as sample-efficient as humans, ~30–50% of lab compute goes to unused inference), then proposes alternatives like On-Policy Self-Distillation, "dreaming"/test-time training, and a 2027–2028 continual-learning scenario where week-long co-work sessions are distilled back into the base model.

Key Quotes

"this kind of training will have created a kind of problem-solving agent: the kind of thing that can make progress on open-ended tasks for weeks on end in the face of errors and mistakes and ambiguity." (00:00:16)
"these models are one one-millionth as sample-efficient as humans" (00:00:50)
"You can't just have a thousand agents go try the same checkout flow on Amazon to get better at using websites, because Andy Jassy will find your bots and shut your ass down." (00:00:43)
"There's two things. There's the context length you train at, and there's a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations." (00:01:05)
"For example, the Cursor Tab model online-learns by predicting the same exact objective for over 400 million requests a day." (00:09:52)

Detailed Summary

Background

The episode is a narration of a blog post released on the author's website at dwarkesh.com (with footnotes), not a sit-down interview.

The Central Research Bet

All labs are betting that training AIs on millions of verifiable tasks across thousands of diverse RL environments will produce an AGI that can make progress on open-ended tasks for weeks despite errors and ambiguity.
Current AI models are roughly one-millionth as sample-efficient as humans during training, a gap optimists hope can be steamrolled by more compute the way NLP problems collapsed for LLMs.

Verifiable & "Grindable" Domains

Coding RL training can spin up a thousand parallel agents against identical copies of a container holding a software repo with a missing feature to be implemented.
Computer use lags coding and math despite being verifiable, because domains must be both verifiable and "grindable" — runnable across many deterministic parallel rollouts from the same starting point.
Skills like building a business, winning court cases, profitable trading, and winning elections require real-world interaction whose outer-loop verification takes months or years and cannot be reproduced inside a datacenter.

Long Context and Wasted Compute

Dario (on a podcast) noted model performance degrades at long context because of the gap between training context length and serving context length, so short-horizon RL training doesn't necessarily generalize to long-horizon performance.
Around 30–50% of a lab's compute currently goes to inference, and that compute is not yet playing any productive role in improving the model.

Online Learning: The Cursor Tab Case Study

The Cursor Tab model online-learns by predicting which edits actually got accepted by the user, running the same exact objective across over 400 million requests a day.
All successfully shipped online-learning models have had to learn the exact same thing across millions of users, because gradient updates are super sample-inefficient and a single session doesn't generate enough data to train a more capable AI.

OPSD vs. RLVR for Continual Learning

On-Policy Self-Distillation (OPSD) is presented as superior to RLVR for continual learning because it (1) does not require an outer-loop verifiable reward, only a model that learns within the context window, and (2) provides denser supervision by training on per-token probability discrepancy between teacher and student rather than a single scalar reward projected through a trajectory.

"Dreaming" / Test-Time Training as a Fourth Axis

EfficientZero illustrates the principle by playing dozens of simulated games in its head for each step in the real game, suggesting a sample-efficiency mechanism that could inspire future LLM training.
Test-time training, or "dreaming," is proposed as a fourth axis of scaling alongside pretraining, RL, and inference-time compute, where the model spends compute writing RL environments and training against them.
A /dream command is floated as a more compute-intensive alternative to /compact (used in Codex, Cursor, Claude), which only kindles a small amount of compute to write a summary.
The author cautions that simulating the whole world is much harder than emulating Go, making test-time training more speculative than its game analogues.

A 2027–2028 Continual-Learning Scenario

The author sketches one scenario where effective context lengths expand enough for AIs to co-work with a user for a full week of wall-clock time.
RLVR-trained agents would be competent enough to get bearings on unfamiliar problems, try different strategies, and iterate past roadblocks — the prerequisite for real-world learning.
At the end of a week of co-work, the user gives a thumbs up or thumbs down, and on approval the base model distills the session's learnings using OPSD, dreaming, or similar techniques.
Capabilities would then expand into domains adjacent to what was RLVR-trained, and then adjacent to what was learned online, growing well beyond verifiable training domains.
Once continual learning arrives, the main way AIs improve shifts from pre-release training to experience accumulated through broad deployment across many task types and users.