Comment by zahlman

24 days ago

> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.

> Only one could do it.

If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.

No science detected.

Without comparison to some null hypothesis (a random policy), this article is hogwash.

  • Given that all the other agents failed to find any creatures, it's hard to imagine that a random policy would except by extreme coincidence.

    • It is possible to be consistently wrong in a way that randomness is not.

      For some problems, randomness outperforms incompetent reasoning