← Back to context

Comment by ben_w

4 days ago

> and they still can't refrain from feeding me false info

If that's your metric, and even then only if you've got a boolean yes/no measurement, then I agree.

If you measure "false info" as a percent, they're better. If you measure scores on IQ tests, on general knowledge, on exams, on the size of a code problem they can write before they have a 20% chance of failure, on the quality of translations they make, on the new modalities like being able to both consume and respond with images, on mathematical olympiad questions, then they're significantly better.

Unfortunately, we can tell by the general public reaction (not just you) that even all those things combined still don't fully capture what normal people mean by "intelligence".

> They are useless to me because I will take more time checking their output than I will just doing the task myself.

What size problem do you give them? I use them in software, and try to keep each single task I give them to ones which would take a human 90 minutes. I can check the quality of an attempt at a human-would-take-90-minutes-to-do task in about 5 minutes.

When I've accidentally let an LLM do bigger tasks than that, then the difficulty of checking goes way up and the quality of the output goes way down.

Conveniently, one of the tasks that generally takes a human less than 90 minutes is breaking down a bigger task in to sub-tasks that themselves take less than 90 minutes. Fail do do this and I get exactly what you experience.