Comment by umairnadeem123

10 hours ago

def useful to show what models recommend in real use (over just meaningless benchmarks), but i still think small prompt wording and repo setup changes can change the outcome quite a bit so id love tighter controls there. having tried claude code with opus 4.6 with slightly different repo setups gives wildly different results IME. i also generally prefer to avoid the NIH syndrome and prefer using off-the-shelf libraries and specifically tell CC to do so - influences the choice outcomes by a lot