Comment by vidarh

5 months ago

The bulk in terms of the number of tokens may well be synthetic data, but I personally know of at least 3 companies, 2 of whom I've done work for, that have people doing substantial amounts of bespoke writing under rather heavy NDAs. I've personally done a substantial amount of bespoke writing for training data for one provider, at good tech contractor fees (though I know I'm one of the highest-paid people for that company and the span of rates is a factor of multiple times even for a company with no exposure to third world contractors).

That said, the speculation you just "get various combinations" of those contributions is nonsense, and it's also by no means only STEM data.

how do those companies gauge that what those contractors are writing isnt AI-generated?

  • It doesn't matter if it's AI-generated per se, so it's no crisis if some make it true. It matters if it is good. So multiple rounds of reviews to judge the output and pick up reviewers that keep producing poor results.

    But I also know they've fired people who were dumb enough to cut and paste a response that included UI elements from a given AI website...