Comment by c7b
21 days ago
Thanks for building what I'd hoped to find the time to build (and much better than what I would have made)! One question: do you think there is room for parallelization here, eg in the retry loop? Local models generally can handle a limited number (~ 2 digits) of concurrent requests pretty well, even on consumer hardware, which can give >10x boosts in the effective number of token/s. I've been thinking for a while about workflows that could take advantage of this, and 'fix this error' could be one (if not ideal) application. Would be curious what you think.
Interesting - so you're thinking give the model two parallel shots at the tool call and take the winner if there is one, or fallback to retry if not?
That would certainly work in theory, but I'm not as familiar with parallel calls.
- If you mean the model calls the tool twice, identically, in a batch call - that would work fine and Forge handles batch calls, but many small models wouldn't think to do that so you'd have to explicitly prompt it to do so.
- If you mean ask the LLM twice to call the tool and look at both answers, my only concern would be latency from doing 2 calls instead of 1.
- Unless you're truly running 2 instances of the model and aren't memory-bandwidth bound, then yes running parallel workflows would likely help. Especially if you could have them compare notes at certain steps or something.
But I haven't explored this much at all so if you're thinking of something else, let me know!
So as an oversimplified PoC, I get:
and four answers (3 of which are immediately usable), with -ns 1 I get :
Now this is probably not the right way to use it, you should probably also use vLLM instead and it's also not a good model to use for this. But there is a real effect here that others have demonstrated, that the GPU is apparently not always maxed out while handling a single request, so sending concurrent requests can yield substantial parallelization benefits. The idea with this application would be something like this: send off the same query in parallel requests, triggering parallel tool calls, and then filter the results (filter out all failing ones, rank the rest by some simple metric of code complexity). There are probably better applications as well, I'm basically just thinking what kinds of tasks could benefit from parallelization.