Comment by verse

1 day ago

I agree with you but I'd point out that unless you've read the book it's difficult to know if the answer you got was accurate or it just kinda made it up. In my experience it makes stuff up.

Like, it behaves as if any answer is better than no answer.

So do humans asked to answer tests. The appropriate thing is to compare to human performance at the same task.

At most of these comprehension tasks, AI is already superhuman (in part because Gary picked scaled tasks that humans are surprisingly bad at).

  • You can't really compare to human performance because the failure modes and performance characteristics are so different.

    In some instances you'll get results that are shockingly good (and in no time), in others you'll have a grueling experience going in circles over fundamental reasoning, where you'd probably fire any person on the spot for having that kind of a discussion chain.

    And there's no learning between sessions or subject area mastery - results on the same topic can vary within same session (with relevant context included).

    So if something is superhuman and subhuman a large percentage of time but there's no good way of telling which you'll get or how - the result isn't the average if you're trying to use the tool.