Comment by SamPatt

6 months ago

Nonsense. Have you used any of the deep research tools?

Don't fall for the utopia fallacy. Humans also publish junk.

Yes, and deep research was junk for the hard topics that I actually needed to sit down and research. Anything shallower I can usually reach by search engine use and scan; deep research saves me about 15-30 minutes for well-covered topics.

For the hard topics, the solution is still the same as pre-AI - search for popular survey papers, then start crawling through the citation network and keeping notes. The LLM output had no idea of what was actually impactful vs what was a junk paper in the niche topic I was interested in so I had no other alternative than quality time with Google Scholar.

We are a long way from deep research even approaching a well-written survey paper written by grad student sweat and tears.

  • > deep research saves me about 15-30 minutes for well-covered topics.

    Most people are capable of maybe 4 good hours a day of deep knowledge work. Saving 30 minutes is a lot.

  • Not everything is hard topics though.

    I've found getting a personalized report for the basic stuff is incredibly useful. Maybe you're a world class researcher if it only saves you 15-30 minutes, I'm positive it has saved me many hours.

    Grad students aren't an inexhaustible resource. Getting a report that's 80% as good in a few minutes for a few dollars is worth it for me.

Steel-man angle: A desire for data provenance is a good thing with benefits that are independent of utopias/humans vs machines kinds of questions.

But, all provenance systems are gamed. I predict the most reliable methods will be cumbersome and not widespread, thus covering little actual content. The easily-gamed systems will be in widespread use, embedded in social media apps, etc.

Questions: 1. Does there exist a data provenance system that is both easy to use and reliable "enough" (for some sufficient definition of "enough")? Can we do bcrypt-style more-bits=more-security and trade time for security?

2. Is there enough of an incentive for the major tech companies to push adoption of such a system? How could this play out?

Yes, but GP's idea of segregating AI-generated content is worth considering.

If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.

  • My 2c is that it is worthwhile to train on AI generated content that has obtained some level of human approval or interest, as a form of extended RLHF loop.

  • I can see the value of labeling all AI can be trained on purely non-AI generated content.

    But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.

    Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.

    • Yes, it's impossible. We'd have to have started years ago. And then people wouldn't have the discipline to label content correctly or at all. It can't be done.

  • Shouldn't there be enough training content from the pre-ai era that the system itself can determine whether content is AI generated, or if it matters?

    • Just ask any person who works in teaching or any of the numerous faulty AI detectors (they're all faulty).

      Any current technology which can used to accurately detect pre-AI content would necessarily imply that that same technology could be used to train an AI to generate content that could skirt by the AI detector. Sure, there is going to be a lag time, but eventually we will run out of non-AI content.

    • No, that's the problem. Pre-AI era content a) is often not dated, so not identifiable as such, and b) also gets out of date. What was thought to be true 20 years ago might not be thought to be true today. Search for the "half-life of facts".

The observation that humans poop is not sufficient justification for spending millions of dollars building an automated firehose that pumps a torrent of shit onto the public square.

  • People are paying millions for access to the models. They are getting value from them or wouldn't be paying.

    It's just not accurate to say they only produce shit. Their rapid adoption demonstrates otherwise.

    • I make no claim to the overall value of LLMs. I'm just pointing out that your analogy is a fallacy. The fact that group A does a small bad thing is not a justification for allowing group B to do a large bad thing. That is true regardless of whether group B does there non-bad things.

      It may be the case that the non-bad things B does outweigh the bad things. That would be an argument in favor of B. The another group doing bad things has no bearing on the justification for B itself.

    • From my experience the people spending "millions" are hoping they get those millions * 10 back because a buddy of theirs told them "this AI thing" is going to replace the most expensive part of companies, the staff costs, not because they think the product is any good. We're getting AI forced down our throat because VC is throwing cash in like there's no tomorrow, not because of whatever value might or might not be there.