Comment by tivert
5 months ago
> Yip. It's pretty obvious this 'innovation' is just based off training data collected from chain-of-thought prompting by people, ie., the 'big leap forward' is just another dataset of people repairing chatgpt's lack of reasoning capabilities.
Which would be ChatGPT chat logs, correct?
It would be interesting if people started feeding ChatGPT deliberately bad repairs due it's "lack of reasoning capabilities" (e.g. get a local LLM setup with some response delays to simulate a human and just let it talk and talk and talk to ChatGPT), and see how it affects its behavior over the long run.
These logs get manually reviewed by humans, sometimes annotated by automated systems first. The setups for manual reviews typically involve half a dozen steps with different people reviewing, comparing reviews, revising comparisons, and overseeing the revisions (source: I've done contract work at every stage of that process, have half a dozen internal documents for a company providing this service open right now). A lot of money is being pumped into automating parts of this, but a lot of money still also flows into manually reviewing and quality-assuring the whole process. Any logs showing significant quality declines would get picked up and filtered out pretty quickly.
So you are saying if we can run these other LLMs for ChatGPT to talk to cheaper than they can review then we either have a monetary denial of service attack against them or a money printing machine if we can get to be part of the review process (apparently I can't link to my favorite "I will write myself a minivan" comic coz someone got cancelled but I trust the reference will work here without link or political back and forth erupting)
> apparently I can't link to my favorite "I will write myself a minivan" comic
It looks like it's been mirrored in several places, e.g.:
https://english.stackexchange.com/questions/488178/what-does...
No.
Because the output of that review process is better training data.
You'd need to produce data that is more expensive to review and improve than random crap from users who are often entirely clueless, and/or that produces worse output of the training process to make using the real prompts as part of that process problematic.
Trying to compete with real users on producing junk input would prove a real challenge in itself - you have no idea the kind of utter incomprehensible drivel real users ask LLMs.
But part of this process also already includes writing a significant number of prompts from scratch, testing them, and then improving the response, to create training data.
From what I've seen, I doubt there is much of a cost saving in using real user prompts there - the benefit you get from real user prompts is a more representative sample, but if that sample starts producing shit you'll just not use it or not use it as much, or only use e.g. prompts from subsets of users you have reason to believe are more likely to be representative of real use.
Put another way: You can hire people to write prompts to replace that side of it far cheaper than you can hire people who can properly review the output of many of the more complex prompts, and the time taken to review the responses is far higher than the time to address issues with the prompts. One provider often tell people to spend up to ~1h to review responses that involve simple coding tasks, for example, but the prompt might be "implement BTree."
i suspect they can detect that in a similar way to capchas and "verify you're human by clicking the box".
> i suspect they can detect that in a similar way to capchas and "verify you're human by clicking the box".
I'm not so sure. IIRC, capchas are pretty much a solved problem, if you don't mind the cost of a little bit of human interaction (e.g. your interface pops up a captcha solver box when necessary, and is solved either by the bot's operator or some professional captcha-solver in a low-wage country).
there are services that solve captchas automatically with above-human success rates