Comment by fc417fc802
8 hours ago
Not sure how to answer because we were off on a tangent there about mental models.
I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.
So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.
Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.
If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.
Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.
But of course, that's not quite "long term"
Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool.
If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.