Comment by htrp

5 hours ago

We pay people to create more high quality tokens (mercor, turing) which are then fed into data generating processes (synthetic data) to create even more tokens to train on

2 comments

htrp

drob518 4 hours ago

But does that really help, or do you get distortion? The frequency distribution of human generated content moves slowly over time as new subjects are discussed. What frequency distribution do those “data generating processes” use? And at root, aren’t those “data generating processes” basically just another LLM (I.e., generating tokens according to a probability distribution)? Thus, aren’t we just sort of feeding AI slop into the next training run and humoring ourselves by renaming the slop as “synthetic data?” Not trying to be argumentative. I’m far from being an AI expert, so maybe I’m missing it. Feel free to explain why I’m wrong.

htrp 3 hours ago

That's the problem in a nutshell. There is an art to how you generate the synthdata so that you don't get crappy trained models (especially when mistakes cost XX million dollars).
It's also theoretically why facebook paid 14bn for alex wang and scale ai