Comment by minimaxir

1 month ago

> They are not "getting worse" they "have been bad".

The agents available in January 2025 were much much worse than the agents available in November 2025.

5 comments

minimaxir

Yes, and for some cases no.

The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.

minimaxir 1 month ago
That's what test cases are for, which is good for both humans and nonhumans.
- Snuggly73 1 month ago
  
  Test cases are great, but not a total solution. Can you write a test case for the add_numbers(a, b) function?
  
  2 replies →