Comment by elAhmo

1 hour ago

As if choosing a model to use on its own is not hard, offering six levels of "effort" (quite a vague term as well), low, medium, high, xhigh, max, ultracode (?!?!) is really making comparisons next to impossible when people using the same model can have vastly different experiences.

What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.

Probably limits the number of intermediate tokens one way or the other. Almost certainly the impact on the result is close to zero.

Not only this but hermetic checks on local machines for spot testing new models is becoming increasingly difficult, if not impossible.

- We have 0 visibility into what Anthropic does with our own prompts server side (do they return cached results from similar queries? Do we develop our own hot paths?).

- Local memory files are written independent of project directory and are acted on by the new models, even if old models wrote them

- CLAUDE.md files have varying degrees of efficiency and different models (and effort) treat them differently

- Our own git history "supports" newer models - ie if you have a larger body of work in git when you adopt a new model (like 4.8) than when you started from scratch with 4.6 or something, 4.8 may "appear" smarter when in fact you just have more evidence and signal about what you intend for a model to do.