[@DwarkeshPatel] Why AI Needs a Trillion Words to Do What Humans Do Easily - Dario Amodei
· 3 min read
Link: https://youtu.be/XLuI5RiMncA
Short Summary
Early language models like GPT-1 faced generalization limits due to training on narrow datasets like fanfiction rather than broad internet data. To address this, shifting to comprehensive corpora such as Commoncrawl and Reddit significantly enhanced model performance. A key challenge remains the sample efficiency gap, where current models process trillions of tokens compared to the much smaller input volumes involved in human learning.
Key Quotes
Key Quotes
- "If you look at the models before GPT1, they were trained on these data sets that didn't represent a wide distribution of text, right? You had, you know, these very standard kind of language modeling benchmarks. And GBT1 itself was trained on a bunch of, I think it was fanfiction actually, which is a very small fraction of the text that you get. And it didn't generalize well. You know, we had all these measures of like how well does a model do at predicting all of these other kinds of texts. You really didn't see the generalization." (00:00:00)
- "It was only when you trained over all the tasks on the internet when you when you kind of did a general internet scrape right from something like common crawl or scraping links on Reddit which is what we did for GPT2 it's only when you do that that you kind of started to get generalization" (00:00:27)
- "but I think there is a puzzle here when we train the model on pre-training we use like trillions of tokens right and and humans don't see trillions of words so there is an actual sample efficiency difference year." (00:00:42)
Detailed Summary
- Evolution of Training Data: Models prior to GPT-1 were constrained by datasets lacking wide text distribution, limiting their generalization capabilities.
- GPT-1 Specifics: GPT-1 leveraged fanfiction data, a small fraction of available text, which restricted its potential compared to broader internet sources.
- Strategic Recommendations: Training on extensive internet scrapes like Commoncrawl or Reddit links is recommended to significantly improve model generalization.
- Sample Efficiency Puzzle: A significant disparity exists between current models, trained on trillions of tokens, and human learning, which operates on a much smaller data volume.
