Comment by bee_rider

11 hours ago

Would it be possible to give that single agent process twice as much compute, instead? Or do production systems not scale that way?

It's hard to speed up running a single prompt through a model because it's a sequential memory-bandwidth limited process (you roughly need to cycle through all of the weights in the model to get the next token, before starting again, but a GPU can do many more operations than a single weight application in between memory fetches). So it's a lot more efficient with current hardware to run multiple prompts in parallel on the same weights.

Also, the limiting factor on a single instance of an agent is generally how much of its context window gets filled up, as opposed to how much time it has to 'think'. Generally the model performance decreases at the context grows (more or less getting dumber the more it has to think about), so agent frameworks try to mitigate this by summarizing the work of one instance and passing it into another instance with a fresh context. This means if you have five tasks that are all going to fill up a model's addressable context, there's no real benefit to running them sequentially unless they naturally feed into each other.

You can increase LLM inference throughput by using smaller batch sizes but that scales non-linearly in practice. It probably isn't worth it unless your model provider makes it really easy.