Comment by ulrikrasmussen
7 hours ago
I guess the experiment is interesting to determine if a model can produce something subjectively valued as "good" based on fairly vague and open-ended specifications. The benchmark is not to determine if the output fits the input, but whether the output is internally consistent: it's a game, but does it behave as one would expect that any game behaves? Does it end when you each the goal, do you die when hitting the spikes, are there weird edge cases in behavior when you move around?
I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.
No comments yet
Contribute on Hacker News ↗