Comment by rcxdude

11 hours ago

It's hard to speed up running a single prompt through a model because it's a sequential memory-bandwidth limited process (you roughly need to cycle through all of the weights in the model to get the next token, before starting again, but a GPU can do many more operations than a single weight application in between memory fetches). So it's a lot more efficient with current hardware to run multiple prompts in parallel on the same weights.

Also, the limiting factor on a single instance of an agent is generally how much of its context window gets filled up, as opposed to how much time it has to 'think'. Generally the model performance decreases at the context grows (more or less getting dumber the more it has to think about), so agent frameworks try to mitigate this by summarizing the work of one instance and passing it into another instance with a fresh context. This means if you have five tasks that are all going to fill up a model's addressable context, there's no real benefit to running them sequentially unless they naturally feed into each other.

0 comments

rcxdude

No comments yet

Contribute on Hacker News ↗