Comment by karmasimida
6 months ago
They should at least clarify it. The reason they don’t I feel is simply for the hype and mystique.
There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeat
I still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company
No comments yet
Contribute on Hacker News ↗