Comment by tgtweak

3 days ago

I think frontier models can do more with fewer tokens (and do the wrong thing far less often) than a "really fast" small model.

There are use cases for fast/ultrafast inferrence models - classifying text, scoring things, extracting information - but for coding and other knowledge tasks - you're not going to get to your solution faster at 16,000 tokens/s if the solution never comes (or is the wrong one).