Comment by c7b

21 days ago

Thanks for building what I'd hoped to find the time to build (and much better than what I would have made)! One question: do you think there is room for parallelization here, eg in the retry loop? Local models generally can handle a limited number (~ 2 digits) of concurrent requests pretty well, even on consumer hardware, which can give >10x boosts in the effective number of token/s. I've been thinking for a while about workflows that could take advantage of this, and 'fix this error' could be one (if not ideal) application. Would be curious what you think.

2 comments

c7b

zambelli 21 days ago

Interesting - so you're thinking give the model two parallel shots at the tool call and take the winner if there is one, or fallback to retry if not?

That would certainly work in theory, but I'm not as familiar with parallel calls.

- If you mean the model calls the tool twice, identically, in a batch call - that would work fine and Forge handles batch calls, but many small models wouldn't think to do that so you'd have to explicitly prompt it to do so.

- If you mean ask the LLM twice to call the tool and look at both answers, my only concern would be latency from doing 2 calls instead of 1.

- Unless you're truly running 2 instances of the model and aren't memory-bandwidth bound, then yes running parallel workflows would likely help. Especially if you could have them compare notes at certain steps or something.

But I haven't explored this much at all so if you're thinking of something else, let me know!

c7b 21 days ago

So as an oversimplified PoC, I get:

    llama-parallel -m ~/models/Qwen3.5-4B-Q8_0.gguf -ns 4 -p "Fix this Python code, answer with code only: prnt('Hello World)" -pps

    llama_perf_context_print:        load time =    1181.90 ms
    llama_perf_context_print: prompt eval time =     190.57 ms /   374 tokens (    0.51 ms per token,  1962.49 tokens per second)
    llama_perf_context_print:        eval time =    3612.25 ms /   159 runs   (   22.72 ms per token,    44.02 tokens per second)
    llama_perf_context_print:       total time =    4302.84 ms /   533 tokens
    llama_perf_context_print:    graphs reused =        155

and four answers (3 of which are immediately usable), with -ns 1 I get :

    llama_perf_context_print:        load time =    1185.61 ms
    llama_perf_context_print: prompt eval time =     187.55 ms /   305 tokens (    0.61 ms per token,  1626.27 tokens per second)
    llama_perf_context_print:        eval time =     158.92 ms /     7 runs   (   22.70 ms per token,    44.05 tokens per second)
    llama_perf_context_print:       total time =     468.85 ms /   312 tokens
    llama_perf_context_print:    graphs reused =          6

Now this is probably not the right way to use it, you should probably also use vLLM instead and it's also not a good model to use for this. But there is a real effect here that others have demonstrated, that the GPU is apparently not always maxed out while handling a single request, so sending concurrent requests can yield substantial parallelization benefits. The idea with this application would be something like this: send off the same query in parallel requests, triggering parallel tool calls, and then filter the results (filter out all failing ones, rank the rest by some simple metric of code complexity). There are probably better applications as well, I'm basically just thinking what kinds of tasks could benefit from parallelization.