Comment by martinald
11 hours ago
Maybe I'm missing something in this paper, but this seems to me to be just pretty "standard" caching stuff, albeit:
a) very time sensitive b) huge files c) scoped per user
Sort of reminds me of video streaming on CDNs for live video (but per user)?
I still think the big win is going to come based on time of use/live capacity. In a pure economics sense you want to charge a lot for inference when it's oversubscribed and far less when it's off peak (see electricity markets).
We have seen this with anthropics peak times, but it's very blunt currently. We also saw this with batch processing back in the day, but that breaks down because agents are 'chatty' and need to send new responses ASAP. You can't wait ages for each response - it would take weeks to do a simple agentic task if you had to wait hours between turn.
So I think what we'll see is async agents queued up, that you can then decide when to run them - either 'immediately' for time sensitive stuff (for more $$$) or 'best effort' where they can be scheduled to run whenever the provider wants to (3am say). If you also have diagnostics that usually agent task xyz takes y tokens total you can do far more efficient scheduling of these. This also reduces the amount of KVcache gymnastics significantly, as you can dedicate that agent task to a certain rack and schedule it all efficiently.
tl;dr I think the issues with inference efficiency need to be solved at a higher abstraction level of per agent "task" not purely on a per chat message basis. If you can schedule a load of agentic use cases off peak you don't need to preempt them because there is spare capacity by nature.
No comments yet
Contribute on Hacker News ↗