Comment by aizk

4 hours ago

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

8 comments

aizk

bcherny 4 hours ago

A mix of evals and vibes.

giwook 4 hours ago
What's that ratio exactly
- nimchimpsky 33 minutes ago
  
  [dead]
- nothinkjustai 4 hours ago
  
  [flagged]
capnchaos 4 hours ago

Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.
efields 2 hours ago

"Evals and vibes" can I put that on a t shirt?

cududa 2 minutes ago

Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria

try-working 2 hours ago

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow