Comment by avbanks

5 months ago

I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.

Would love to know just how overfit a lot of them are on these benchmarks