Comment by c7b

21 days ago

So as an oversimplified PoC, I get:

    llama-parallel -m ~/models/Qwen3.5-4B-Q8_0.gguf -ns 4 -p "Fix this Python code, answer with code only: prnt('Hello World)" -pps

    llama_perf_context_print:        load time =    1181.90 ms
    llama_perf_context_print: prompt eval time =     190.57 ms /   374 tokens (    0.51 ms per token,  1962.49 tokens per second)
    llama_perf_context_print:        eval time =    3612.25 ms /   159 runs   (   22.72 ms per token,    44.02 tokens per second)
    llama_perf_context_print:       total time =    4302.84 ms /   533 tokens
    llama_perf_context_print:    graphs reused =        155

and four answers (3 of which are immediately usable), with -ns 1 I get :

    llama_perf_context_print:        load time =    1185.61 ms
    llama_perf_context_print: prompt eval time =     187.55 ms /   305 tokens (    0.61 ms per token,  1626.27 tokens per second)
    llama_perf_context_print:        eval time =     158.92 ms /     7 runs   (   22.70 ms per token,    44.05 tokens per second)
    llama_perf_context_print:       total time =     468.85 ms /   312 tokens
    llama_perf_context_print:    graphs reused =          6

Now this is probably not the right way to use it, you should probably also use vLLM instead and it's also not a good model to use for this. But there is a real effect here that others have demonstrated, that the GPU is apparently not always maxed out while handling a single request, so sending concurrent requests can yield substantial parallelization benefits. The idea with this application would be something like this: send off the same query in parallel requests, triggering parallel tool calls, and then filter the results (filter out all failing ones, rank the rest by some simple metric of code complexity). There are probably better applications as well, I'm basically just thinking what kinds of tasks could benefit from parallelization.

0 comments

c7b

No comments yet

Contribute on Hacker News ↗