Comment by avbanks
1 year ago
I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.
1 year ago
I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.
Would love to know just how overfit a lot of them are on these benchmarks