← Back to context

Comment by hodgehog11

2 months ago

Under what metrics are you judging these improvements? If you're talking about improving benchmark scores, as others have pointed out, those are increasing at a regular rate (putting aside the occasional questionable training practices where the benchmark is in the training set). But most individuals seem to be judging "order of magnitude jumps" in terms of whether the model can solve a very specific set of their use cases to a given level of satisfaction or not. This is a highly nonlinear metric, so changes will always appear to be incremental until suddenly it isn't. Judging progress in this way is alchemy, and leads only to hype cycles.

Every indication I've seen is that LLMs are continuing to improve, each fundamental limitation recognized is eventually overcome, and there are no meaningful signs of slowing down. Unlike prior statistical models which have fundamental limitations without solutions, I have not seen evidence to suggest that any particular programming task that can be achieved by humans cannot eventually be solvable by LLM variants. I'm not saying that they necessarily will be, of course, but I'd feel a lot more comfortable seeing evidence that they won't.

I think it actually makes sense to trust your vibes more than benchmarks. The act of creating a benchmark is the hard part. If we had a perfect benchmark AI problems would be trivially solvable. Benchmarks are meaningless on their own, they are supposed to be a proxy for actual usefulness.

I'm not sure what is better than, can it do what I want? And for me the ratio of yes to no on that hasn't changed too much.

  • I agree that this is a sensible judgement for practical use, but my point is that the vibes likely will change, it's just a matter of when. You can't draw a trendline on a nonlinear metric especially when you have no knowledge of the inflection point. Individual benchmarks are certainly fallible, and we always need better ones, but the aggregate of all of the benchmarks together (and other theoretical metrics not based on test data) is correlating reasonably well with opinion polling and these are all improving at a consistent rate. It's just that it's unclear when these model improvements will lead to the outcomes that you're looking for. When it happens, it will appear like a massive leap in performance, but really it's just a threshold being hit.