Comment by toss1
1 day ago
Indeed they are not forced to train them on user outputs, but the author of the article seems to have found good evidence that they are actually doing that, and will need more expert data-tagging/filtering on the inputs to regain their previous performance
I don't think the author of the article found "good evidence". He found a specific case where there was a regression. This could be due to:
- models actually getting worse in general
- his specific style of prompting working well with older models and less well with newer models
- the thing his test tests no longer being a priority for big AI labs
From the article:
> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.
Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.
To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.
You're right - I wasn't considering how narrow his case is and was perhaps overgeneralizing, particularly about the cause.
Seems we agree the better solution for column_index_+1 doesn't exist is to call it out instead of stealthily append a new column, but the why the newer models have that behavior is indeed speculative.
It a bit echos the conundrum from back in the PC days where IBM hardware was the de-facto standard, and companies building "compatible" hardware had to decide whether to be compatible with the spec, or compatible with every detail of the implementation, including buggy behavior, of which OFC some software took advantage. So, do they build to be "compatible" or "bug-compatible"?
Was the ChatGPT v4 response highlighting the missing column a bug or failure to shoot straight? Not sure I'd characterize it that way, but there definitely could be many other reasons for the change in behavior (other than training on lower-skilled programmers' inputs) — we really have to consider that as a conjecture on the author's part.