Comment by K0balt

6 months ago

Ai generated content is inherently a regression to the mean and harms both training and human utility. There is no benefit in publishing anything that an AI can generate, just ask the question yourself. Maybe publish all AI content with <AI generated content> tags, but other than that it is a public nuisance much more often than a public good.

Following this logic, why write anything at all? Shakespeare's sonnets are arrangements of existing words that were possible before he wrote them. Every mathematical proof, novel, piece of journalism is simply a configuration of symbols that existed in the space of all possible configurations. The fact that something could be generated doesn't negate its value when it is generated for a specific purpose, context, and audience.

  • Following that logic, we should publish all unique random orderings of words. I think there is a book about a library like that, but it is a great read and is not a regression to the mean of ideas.

    Writing worth reading as a non-child surprises, challenges, teaches, and inspires. LLM writing tends towards the least surprising, worn out tropes that challenge only the patience and attention of the reader. The eager learner, however will tolerate that , so I suppose that I’ll give them teaching. They are great at children’s stories, where the goal is to rehearse and introduce tropes and moral lessons with archetypes, effectively teaching the listener the language of story.

    FWIW I am not particularly a critic of AI and am engaged in AI related projects. I am quite sure that the breakthrough with transformer architecture will lead to the third industrial revolution, for better or for worse.

    But there are some things we shouldn’t be using LLMs for.

This was an intuitively-appealing belief, even with some qualified experimental support, as of a few years ago.

However, since then, a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.

  • AI generates useful stuff, but unless it took a lot of complicated prompting, it's still true that you could "just ask the question yourself."

    This will change as contexts get longer and people start feeding large stacks of books and papers into their prompts.

    • > you could "just ask the question yourself."

      Just like googling, AIing is a skill. You have to know how to evaluate and judge AI responses. Even how to ask the right questions.

      Especially asking the right questions is harder than people realize. You see this difference in human managers where some are able to get good results and others aren’t, even when given the same underlying team.

      4 replies →

    • No, new more-capable and/or efficient models have been forged using bulk outputs of other models as training data.

      These inproved models do some valuable things better & cheaper than the models, or ensembles of models, that generated their training data. So you could not "just ask" the upstream models. The benefits emerge from further bulk training on well-selected synthetic data from the upstream models.

      Yes, it's counterintuitive! That's why it's worth paying attention to, & describing accurately, rather than remaining stuck repeating obsolete folk misunderstandings.

      1 reply →

  • > a bunch of capability breakthroughs from (well-curated) AI generations has definitively disproven it.

    How much work is "well-curated" doing in that statement?

    • Less than you might think! Some of the frontier-advancing training-on-model-outputs ('synthetic data') work just uses other models & automated-checkers to select suitable prompts and desirable subsets of generations.

      I find it (very) vaguely like how a person can improve at a sport or an instrument without an expert guiding them through every step up, just by drilling certain behaviors in an adequately-proper way. Training on synthetic data somehow seems to extract a similar iterative improvement in certain directions, without requiring any more natural data. It's somehow succeeding in using more compute to refine yet more value from the original non-synthetic-training-data's entropy.

      2 replies →

  • How will AI write about a world it never experiences? By training on the work of human beings.

    • The training sets can already include direct data series about the world, where the "work of human beings" is just setting up the the collection devices. So models can absolutely "experience the world".

      But I'm not suggesting they'll advance much, in the near term, without any human-authored training data.

      I'm just pointing out the cold hard fact that lots of recent breakthroughs came via training on synthetic data - text prompted by, generated by, & selected by other AI models.

      That practice has now generated a bunch of notable wins in model capabilities – contra the upthread post's sweeping & confident wrongness alleging "Ai generated content is inherently a regression to the mean and harms both training and human utility".

      9 replies →

  • One example of useful output does not negate the flood of pollution. I’m not denying or downplaying the usefulness of AI. I am doubting the wisdom of blindly publishing -anything- without making at least a trivial attempt to ensure that it is useful and worth publishing. It is a form of pollution.

    The problem is that it lowers the effort required to produce SEO spam and to “publish” to nearly zero, which creates a perverse incentive to shit on the sidewalk.

    The amount of AI created, blatantly false blog posts about drug interactions, for example. Not advertising, just banal filler to create site visits, with dangerously false information.

    It’s not like shitting on the sidewalk was never a problem before, it’s just that shitting on the sidewalk as a service (SOTSAAS) maybe is something we should try to avoid.

  • I didn’t mean to imply that -no- ai generated content is useful, only that the vast, vast majority is pollution. The problem is that it is so cheap to produce garbage content with AI that writing actual content is disincentivized, and doing web searches has become an exercise is sifting through AI generated slop.

    That at least will add extra work to filter usable training data, and costs users minutes a day wading through the refuse.

What about AI modified or copy edited content?

I write blog posts now by dictating into voice notes, transcribing it, and giving it to CGPT or Claude to work on the tone and rhythm.

  • So IMHO an right thing is to add "AI rewritten" label to your blog.

    hm.. I wonder where this kind of label should live? For a personal blog, putting it on every post seems redundant, as if author uses it, it's likely they use it for all posts. And many blogs don't have dedicated "about this blog" section.

    I wonder if things will end up like organic food labeling or "made in .." labels. Some blogs might say "100% by human", some might say "Designed by human, made by AI" and some might just say nothing.

If I ask the question myself then there's no step where a human expert has vetted the content and put their name on it. That curation and vouching is of value.

Now your mind might have immediately went "pffff as if they're doing that" and I agree but only to the extent that it largely wasn't happening prior to AI anyway. The vast majority of internet content was already low quality and rushed out by low paid writers who lacked expertise in what they were writing about. AI doesn't change that.

  • Completely agree. We are used to thinking of authorship as the critical step. We're going to have to adjust to thinking of publication as the critical step. In an ideal world, publication of a piece would be seen as vouching for that piece. Putting your reputation on the line.

    I wonder if we'll see a resurgence in reputation systems (probably not).

Nonsense. Have you used any of the deep research tools?

Don't fall for the utopia fallacy. Humans also publish junk.

  • Yes, and deep research was junk for the hard topics that I actually needed to sit down and research. Anything shallower I can usually reach by search engine use and scan; deep research saves me about 15-30 minutes for well-covered topics.

    For the hard topics, the solution is still the same as pre-AI - search for popular survey papers, then start crawling through the citation network and keeping notes. The LLM output had no idea of what was actually impactful vs what was a junk paper in the niche topic I was interested in so I had no other alternative than quality time with Google Scholar.

    We are a long way from deep research even approaching a well-written survey paper written by grad student sweat and tears.

    • > deep research saves me about 15-30 minutes for well-covered topics.

      Most people are capable of maybe 4 good hours a day of deep knowledge work. Saving 30 minutes is a lot.

    • Not everything is hard topics though.

      I've found getting a personalized report for the basic stuff is incredibly useful. Maybe you're a world class researcher if it only saves you 15-30 minutes, I'm positive it has saved me many hours.

      Grad students aren't an inexhaustible resource. Getting a report that's 80% as good in a few minutes for a few dollars is worth it for me.

  • Steel-man angle: A desire for data provenance is a good thing with benefits that are independent of utopias/humans vs machines kinds of questions.

    But, all provenance systems are gamed. I predict the most reliable methods will be cumbersome and not widespread, thus covering little actual content. The easily-gamed systems will be in widespread use, embedded in social media apps, etc.

    Questions: 1. Does there exist a data provenance system that is both easy to use and reliable "enough" (for some sufficient definition of "enough")? Can we do bcrypt-style more-bits=more-security and trade time for security?

    2. Is there enough of an incentive for the major tech companies to push adoption of such a system? How could this play out?

  • Yes, but GP's idea of segregating AI-generated content is worth considering.

    If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.

    • My 2c is that it is worthwhile to train on AI generated content that has obtained some level of human approval or interest, as a form of extended RLHF loop.

      3 replies →

    • I can see the value of labeling all AI can be trained on purely non-AI generated content.

      But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.

      Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.

      1 reply →

  • The observation that humans poop is not sufficient justification for spending millions of dollars building an automated firehose that pumps a torrent of shit onto the public square.

    • People are paying millions for access to the models. They are getting value from them or wouldn't be paying.

      It's just not accurate to say they only produce shit. Their rapid adoption demonstrates otherwise.

      3 replies →