Comment by singron
5 days ago
> I thought the whole point of transformers was that inference speed no longer depended on prompt length
That's not true at all and is exactly what prompt caching is for. For one, you can at least populate the attention KV Cache, which will scale with the prompt size. It's true that if your prompt is larger than the context size, then the prompt size no longer affects inference speed since it essentially discards the excess.
No comments yet
Contribute on Hacker News ↗