Comment by elAhmo

1 hour ago

As if choosing a model to use on its own is not hard, offering six levels of "effort" (quite a vague term as well), low, medium, high, xhigh, max, ultracode (?!?!) is really making comparisons next to impossible when people using the same model can have vastly different experiences.

What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.

4 comments

elAhmo

Wowfunhappy 31 minutes ago

Something I found helpful: In this article, scroll down to the first big image, which is a graph labeled “Agentic coding performance by effort level”. https://www.anthropic.com/engineering/april-23-postmortem

This convinced me to just always set 4.7 to xhigh. Admittedly not sure about 4.8.

spacebacon 1 hour ago

They are doomed. Publishing small wins while they can.

https://open.substack.com/pub/sublius/p/srt-introspect-why-c...

thaanpaa 35 minutes ago

Probably limits the number of intermediate tokens one way or the other. Almost certainly the impact on the result is close to zero.

kkukshtel 44 minutes ago

Not only this but hermetic checks on local machines for spot testing new models is becoming increasingly difficult, if not impossible.

- We have 0 visibility into what Anthropic does with our own prompts server side (do they return cached results from similar queries? Do we develop our own hot paths?).

- Local memory files are written independent of project directory and are acted on by the new models, even if old models wrote them

- CLAUDE.md files have varying degrees of efficiency and different models (and effort) treat them differently

- Our own git history "supports" newer models - ie if you have a larger body of work in git when you adopt a new model (like 4.8) than when you started from scratch with 4.6 or something, 4.8 may "appear" smarter when in fact you just have more evidence and signal about what you intend for a model to do.