Comment by gojomo
6 months ago
This was an intuitively-appealing belief, even with some qualified experimental support, as of a few years ago.
However, since then, a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.
AI generates useful stuff, but unless it took a lot of complicated prompting, it's still true that you could "just ask the question yourself."
This will change as contexts get longer and people start feeding large stacks of books and papers into their prompts.
> you could "just ask the question yourself."
Just like googling, AIing is a skill. You have to know how to evaluate and judge AI responses. Even how to ask the right questions.
Especially asking the right questions is harder than people realize. You see this difference in human managers where some are able to get good results and others aren’t, even when given the same underlying team.
If you don’t know the answers, how can you judge the machine output?
2 replies →
You have to know how to evaluate and judge articles on the internet, too.
Prompt engineering might not be something people do for much longer:
https://spectrum.ieee.org/prompt-engineering-is-dead
No, new more-capable and/or efficient models have been forged using bulk outputs of other models as training data.
These inproved models do some valuable things better & cheaper than the models, or ensembles of models, that generated their training data. So you could not "just ask" the upstream models. The benefits emerge from further bulk training on well-selected synthetic data from the upstream models.
Yes, it's counterintuitive! That's why it's worth paying attention to, & describing accurately, rather than remaining stuck repeating obsolete folk misunderstandings.
That's a process that's internal to companies doing training. It has nothing to do with publishing outputs on the internet.
> a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.
How much work is "well-curated" doing in that statement?
Less than you might think! Some of the frontier-advancing training-on-model-outputs ('synthetic data') work just uses other models & automated-checkers to select suitable prompts and desirable subsets of generations.
I find it (very) vaguely like how a person can improve at a sport or an instrument without an expert guiding them through every step up, just by drilling certain behaviors in an adequately-proper way. Training on synthetic data somehow seems to extract a similar iterative improvement in certain directions, without requiring any more natural data. It's somehow succeeding in using more compute to refine yet more value from the original non-synthetic-training-data's entropy.
"adequately-proper way" is doing an incredible amount of heavy lifting in that sentence.
1 reply →
How will AI write about a world it never experiences? By training on the work of human beings.
The training sets can already include direct data series about the world, where the "work of human beings" is just setting up the the collection devices. So models can absolutely "experience the world".
But I'm not suggesting they'll advance much, in the near term, without any human-authored training data.
I'm just pointing out the cold hard fact that lots of recent breakthroughs came via training on synthetic data - text prompted by, generated by, & selected by other AI models.
That practice has now generated a bunch of notable wins in model capabilities – contra the upthread post's sweeping & confident wrongness alleging "Ai generated content is inherently a regression to the mean and harms both training and human utility".
> models can absolutely "experience the world"
How does the banana bread taste at the café around the corner? What's the vibe like there? Is it a good place for people-watching?
What's the typical processing time for a family reunion visa in Berlin? What are the odds your case worker will speak English? Do they still accept English-language documents or do they require a certified translation?
Is the Uzbek-Tajik border crossing still closed? Do foreigners need to go all the way to the northern crossing? Is the Pamir highway doable on a bicycle? How does bribery typically work there? Are people nice?
The world is so much more than the data you have about it.
4 replies →
> data series about the world, where the "work of human beings" is just setting up the the collection devices. So models can absolutely "experience the world"
But not experience it the way humans do.
We don’t experience a data series; we experience sensory input in a complicated, nuanced way, modified by prior experiences and emotions, etc. remember that qualia is subjective, with a biological underpinning.
3 replies →
One example of useful output does not negate the flood of pollution. I’m not denying or downplaying the usefulness of AI. I am doubting the wisdom of blindly publishing -anything- without making at least a trivial attempt to ensure that it is useful and worth publishing. It is a form of pollution.
The problem is that it lowers the effort required to produce SEO spam and to “publish” to nearly zero, which creates a perverse incentive to shit on the sidewalk.
The amount of AI created, blatantly false blog posts about drug interactions, for example. Not advertising, just banal filler to create site visits, with dangerously false information.
It’s not like shitting on the sidewalk was never a problem before, it’s just that shitting on the sidewalk as a service (SOTSAAS) maybe is something we should try to avoid.
I didn’t mean to imply that -no- ai generated content is useful, only that the vast, vast majority is pollution. The problem is that it is so cheap to produce garbage content with AI that writing actual content is disincentivized, and doing web searches has become an exercise is sifting through AI generated slop.
That at least will add extra work to filter usable training data, and costs users minutes a day wading through the refuse.