← Back to context

Comment by onlyrealcuzzo

2 days ago

I'm interested how much "Clean Data" is synthetic data from "unclean" models...

> with AI-generated content excluded from pre-training.

> without distillation from third-party models

sounds like zero unless they are lying.

  • > with AI-generated content excluded from pre-training.

    Though this is largely impossible these days, unless they pre-trained on pre-AI era data.

    • That could be. Just use pre-training for language understanding and let the post-training on synthetic data do the heavy lifting.

  • "how many of those shapes are rectangles?" "sounds like zero unless they are squares"

    Adding "unless" to a statement makes it vacuous if the latter clause is weaker than the first clause. I find it hard to believe that a company willing to violate licenses would have scruples about lying about it.

    • Not vacuous, but tautological. Which is different, because tautologies can actually be quite directly informative. Whereas vacuous truths tend to be oblique.

      Also, “Microsoft is lying” is not a logically stronger statement, because they might be lying about something other than whether they distilled or trained on AI output.

    • Adding "unless" to a statement makes it vacuous if the latter clause is weaker than the first clause

      I think that's the point. "How do I say they're lying without outright saying they're lying?"

      It's a common rhetorical trick.

      1 reply →