← Back to context

Comment by BoorishBears

3 days ago

The benchmarks also claim random 32B parameter models beat Claude 4 at coding, so we know just how much they matter.

It should be obvious to anyone who with a cursory interest in model training, you can't trust benchmarks unless they're fully private black-boxes.

If you can get even a hint of the shape of the questions on a benchmark, it's trivial to synthesize massive amounts of data that help you beat the benchmark. And given the nature of funding right now, you're almost silly not to do it: it's not cheating, it's "demonstrably improving your performance at the downstream task"