Comment by zahlman
23 days ago
> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
> Only one could do it.
If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.
No science detected.
Without comparison to some null hypothesis (a random policy), this article is hogwash.
Given that all the other agents failed to find any creatures, it's hard to imagine that a random policy would except by extreme coincidence.
It is possible to be consistently wrong in a way that randomness is not.
For some problems, randomness outperforms incompetent reasoning