Comment by athrowaway3z
7 hours ago
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
No comments yet
Contribute on Hacker News ↗