Comment by vlovich123

14 hours ago

Nope, there’s no tricks unless there’s been major architectural shifts I missed. The rot doesn’t come from inference tricks to try to bring down quadratic complexity of the KV cache. Task performance problems are generally a training problem - the longer and larger the data set, the fewer examples you have to train on it. So how do you train the model to behave well - that’s where the tricks are. I believe most of it relies on synthetically generated data if I’m not mistaken, which explains the rot.

A quick Google search reveals terms such as "sparse attention" that are used to avoid quadratic runtime.

I don't know if Anthropic has revealed such details since AI research is getting more and more secretive, but the architectural tricks definitely exist.