Comment by robertkarl

14 hours ago

In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.

4 comments

robertkarl

addandsubtract 1 hour ago

Didn't they originally introduce those tokens to make the models smarter by second guessing their "thoughts"?

orbital-decay 6 hours ago

I imagine Anthropic would rather train a small control model instead of resorting to sampling hacks

meatmanek 11 hours ago

This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique?

iaw 6 hours ago

vLLM supports "banning" certain tokens but I don't know if it can dynamically reduce them.
To my knowledge you can also "ban" with llama.cpp but it is passed in the API call rather than to the server at initialization.