← Back to context

Comment by baq

8 days ago

typical experience when only using one foundational model TBH. results are much better if you let different models review each other.

the bottleneck now is testing. that isn't going away anytime soon, it'll get much worse for a bit while models are good at churning code out that's slightly wrong or technically correct, but solving a different problem than intended; it's going to be a relatively short lived situation I'm afraid until the industry switches to most code being written for serving agents instead of humans.

The way LLMs work, different tokens can activate different parts of the network. I generally have 2-3 different agents review it from different perspectives. I give them identities, like Martin Fowler, or Uncle Bob, or whatever I think is relevant.

  • true - but the way LLMs are trained, google's RLVR is different from anthropic's is different from openai's. you'll get very good results sending the same 'review this change' prompt (literally) to different models.